<?xml version="1.0" encoding="UTF-8"?>
<rss  xmlns:atom="http://www.w3.org/2005/Atom" 
      xmlns:media="http://search.yahoo.com/mrss/" 
      xmlns:content="http://purl.org/rss/1.0/modules/content/" 
      xmlns:dc="http://purl.org/dc/elements/1.1/" 
      version="2.0">
<channel>
<title>A Statistician&#39;s R Notebook</title>
<link>https://mfatihtuzen.github.io/</link>
<atom:link href="https://mfatihtuzen.github.io/index.xml" rel="self" type="application/rss+xml"/>
<description>Blog posts about R and Statistics</description>
<generator>quarto-1.9.37</generator>
<lastBuildDate>Fri, 24 Apr 2026 00:00:00 GMT</lastBuildDate>
<item>
  <title>Publishing a Quarto Blog: What I Learned Moving from Netlify to GitHub Pages</title>
  <dc:creator>M. Fatih Tüzen</dc:creator>
  <link>https://mfatihtuzen.github.io/posts/2026-04-24_quarto_blog_github/</link>
  <description><![CDATA[ 






<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://mfatihtuzen.github.io/posts/2026-04-24_quarto_blog_github/netlify_github.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<section id="introduction" class="level2" data-number="1">
<h2 data-number="1" class="anchored" data-anchor-id="introduction"><span class="header-section-number">1</span> Introduction</h2>
<p>Quarto makes it surprisingly easy to build a blog.</p>
<p>You write your content, render it, and publish it. Everything works—until it doesn’t.</p>
<p>Quarto has made it remarkably easy to create modern technical websites, blogs, books, and reports from plain text files. A typical Quarto website can combine narrative text, executable code, figures, tables, references, and multiple output formats in a single reproducible publishing workflow. In that sense, Quarto is not only a writing tool; it is also a publishing system designed especially for computational and data-driven content. The official Quarto documentation describes websites as projects that can be rendered and published to several destinations, including GitHub Pages, Netlify, Posit Connect, and other static hosting services <span class="citation" data-cites="quartoWebsites quartoPublishing">(Posit PBC 2026a, 2026b)</span>.</p>
<p>For someone writing about R, statistics, or data science, this is very attractive. You can write a blog post in <code>.qmd</code>, run your R code inside the document, generate plots and tables, render the site locally, and then publish the resulting static files. At first glance, the workflow looks almost linear:</p>
<ol type="1">
<li>write the content,</li>
<li>render the site,</li>
<li>deploy it,</li>
<li>share the link.</li>
</ol>
<p>Many introductory tutorials understandably focus on this smooth path. They explain how to create a Quarto website, configure the <code>_quarto.yml</code> file, add posts, render the project, and publish the site. These steps are necessary, but they do not fully describe what happens when a Quarto blog becomes a living project rather than a one-time demo.</p>
<p>The real questions usually appear later. What happens when the site grows? What happens when posts include code, external data sources, generated images, downloadable files, or multiple output formats? What happens when the website builds successfully on your own computer but fails in the deployment environment? At that point, publishing is no longer just about pushing HTML files to the web. It becomes a question of reproducibility, dependency management, build strategy, and platform choice.</p>
<p>This article reflects on that second stage: the stage where a Quarto blog moves from a local project to a maintained public website. More specifically, it discusses the practical lessons learned while moving a Quarto-based blog from Netlify to GitHub Pages. The aim is not to provide another “click here, then click there” tutorial. Instead, the goal is to discuss the kinds of issues that are often invisible at the beginning: build limits, environment differences, hidden dependencies, external services, file paths, output formats, and the trade-offs between convenience and control.</p>
<p>In short, this is a real-world deployment story. Not because the technical details are unique, but because the pattern is common: a tool works beautifully in local development, then the publishing pipeline reveals the assumptions we did not know we were making.</p>
<hr>
</section>
<section id="when-things-start-to-break" class="level2" data-number="2">
<h2 data-number="2" class="anchored" data-anchor-id="when-things-start-to-break"><span class="header-section-number">2</span> When Things Start to Break</h2>
<p>Like many users, I initially chose Netlify as my deployment platform. It is fast, easy to configure, and works very well for traditional static websites. With minimal setup, it is possible to connect a repository, trigger automatic builds, and publish a site within minutes. For simple blogs and documentation pages, this model is both convenient and efficient.</p>
<p>For a while, everything worked smoothly.</p>
<p>However, as the project evolved, the nature of the website also started to change. What initially looked like a static blog gradually became a more dynamic, computation-driven project. Posts were no longer just text; they included code execution, data processing, and generated outputs such as figures, tables, and downloadable files.</p>
<p>At this point, some structural limitations of build-based deployment started to become more visible.</p>
<p>First, every deployment is essentially a full rebuild. Even small changes may trigger a complete build process, depending on the configuration. While this is not an issue for lightweight static content, it becomes more significant for projects that rely on computation.</p>
<p>Second, data-driven Quarto projects are inherently heavier than typical static sites. Rendering a post may involve running R code, loading libraries, generating plots, or even accessing external data sources. These steps increase both build time and resource usage.</p>
<p>Third, frequent updates amplify the effect. A workflow that feels fast at the beginning can become noticeably slower as the number of posts grows and the project becomes more complex. Over time, this can translate into longer build durations and increased consumption of available resources.</p>
<p>None of these are “failures” in the strict sense. They are natural consequences of using a system designed primarily for static content in a context that increasingly behaves like a computational workflow.</p>
<p>At this stage, the central question was no longer:</p>
<blockquote class="blockquote">
<p><em>How do I deploy this site?</em></p>
</blockquote>
<p>but rather:</p>
<blockquote class="blockquote">
<p><em>Is this deployment model sustainable for a data-driven Quarto project in the long run?</em></p>
</blockquote>
</section>
<section id="moving-to-github-pages" class="level2" data-number="3">
<h2 data-number="3" class="anchored" data-anchor-id="moving-to-github-pages"><span class="header-section-number">3</span> Moving to GitHub Pages</h2>
<p>At this point, the decision to explore alternatives was not driven by a single failure, but by a growing mismatch between the project’s needs and the deployment model.</p>
<p>GitHub Pages emerged as a natural alternative.</p>
<p>Unlike platforms that rely on external build services, GitHub Pages is closely integrated with the repository itself. This creates a different workflow: instead of delegating the entire process to a managed service, the developer has more direct control over how the site is built and deployed.</p>
<p>This shift might seem subtle, but it changes the way you think about publishing.</p>
<p>In a repository-driven approach, the website is no longer just an output. It becomes part of a controlled pipeline:</p>
<ul>
<li>the source files are versioned,</li>
<li>the build process is explicitly defined,</li>
<li>and the output is reproducible under the same conditions.</li>
</ul>
<p>This level of control is particularly important for projects that include code execution and data processing. When rendering depends on computations, it becomes essential to understand <em>how</em> and <em>where</em> those computations are performed.</p>
<p>Another important difference is transparency. Build logs, dependency resolution, and execution steps are visible and traceable. While this may introduce additional complexity at first, it also makes debugging and long-term maintenance significantly easier.</p>
<p>Of course, this approach comes with a trade-off.</p>
<p>Compared to Netlify, GitHub Pages requires a bit more effort to set up and maintain. It is less “plug-and-play” and more “build-your-own-pipeline.” However, for projects that go beyond simple static content, this added responsibility often translates into greater flexibility.</p>
<p>In that sense, the transition was not just about switching platforms. It was about moving from a convenience-oriented model to a control-oriented one.</p>
<p>And that shift becomes especially meaningful once the project starts to grow.</p>
</section>
<section id="what-you-dont-see-in-tutorials" class="level2" data-number="4">
<h2 data-number="4" class="anchored" data-anchor-id="what-you-dont-see-in-tutorials"><span class="header-section-number">4</span> What You Don’t See in Tutorials</h2>
<p>Most tutorials focus on the ideal path: everything works, the site renders, and deployment succeeds. While this is useful for getting started, it often hides an important reality.</p>
<p>As soon as a project moves beyond a simple example, a different set of challenges begins to emerge—challenges that are rarely discussed in introductory guides.</p>
<section id="environment-differences" class="level3" data-number="4.1">
<h3 data-number="4.1" class="anchored" data-anchor-id="environment-differences"><span class="header-section-number">4.1</span> Environment Differences</h3>
<p>One of the first realizations is that the local environment and the deployment environment are fundamentally different.</p>
<p>A project that works perfectly on a personal machine may fail when executed elsewhere. Differences in operating systems, available libraries, or system configurations can lead to unexpected behavior.</p>
<blockquote class="blockquote">
<p>If it works locally, it only proves one thing: it works locally.</p>
</blockquote>
<hr>
</section>
<section id="dependency-management" class="level3" data-number="4.2">
<h3 data-number="4.2" class="anchored" data-anchor-id="dependency-management"><span class="header-section-number">4.2</span> Dependency Management</h3>
<p>Dependencies are not always as explicit as they seem. Even when a project appears to rely on a small set of libraries, there are often additional layers:</p>
<ul>
<li>indirect dependencies</li>
<li>optional components</li>
<li>version-specific behaviors</li>
</ul>
<p>These hidden relationships can make a project fragile when moved across environments.</p>
<hr>
</section>
<section id="system-level-requirements" class="level3" data-number="4.3">
<h3 data-number="4.3" class="anchored" data-anchor-id="system-level-requirements"><span class="header-section-number">4.3</span> System-Level Requirements</h3>
<p>Not all requirements are defined within the project itself. Some dependencies exist at the system level, especially for:</p>
<ul>
<li>graphics rendering</li>
<li>font handling</li>
<li>data processing backends</li>
</ul>
<p>These are often invisible during development but become critical during deployment, particularly in clean or minimal environments.</p>
<hr>
</section>
<section id="file-and-path-handling" class="level3" data-number="4.4">
<h3 data-number="4.4" class="anchored" data-anchor-id="file-and-path-handling"><span class="header-section-number">4.4</span> File and Path Handling</h3>
<p>File handling is more sensitive than it appears. Paths that work locally may fail in another environment due to:</p>
<ul>
<li>differences in working directories</li>
<li>case sensitivity in file systems</li>
<li>missing intermediate outputs</li>
</ul>
<p>Even small assumptions about file locations can introduce subtle but impactful errors.</p>
<hr>
</section>
<section id="external-data-sources" class="level3" data-number="4.5">
<h3 data-number="4.5" class="anchored" data-anchor-id="external-data-sources"><span class="header-section-number">4.5</span> External Data Sources</h3>
<p>Using external data sources introduces another layer of uncertainty.</p>
<p>While integrating APIs or remote datasets is convenient, it also creates dependencies on factors outside the project’s control:</p>
<ul>
<li>network availability</li>
<li>response times</li>
<li>service stability</li>
</ul>
<blockquote class="blockquote">
<p>Every external dependency is a potential failure point.</p>
</blockquote>
<hr>
</section>
<section id="output-complexity" class="level3" data-number="4.6">
<h3 data-number="4.6" class="anchored" data-anchor-id="output-complexity"><span class="header-section-number">4.6</span> Output Complexity</h3>
<p>Supporting multiple output formats can significantly increase complexity. While HTML is typically straightforward, additional formats may require:</p>
<ul>
<li>extra tools</li>
<li>additional configuration</li>
<li>longer build processes</li>
</ul>
<p>As the number of outputs grows, so does the likelihood of unexpected issues during rendering.</p>
<hr>
<p>These challenges are not unique to any specific platform. They are inherent to projects that combine content, computation, and deployment into a single workflow.</p>
<p>And they tend to appear only after the initial setup phase—when the project starts to grow.</p>
</section>
</section>
<section id="lessons-learned" class="level2" data-number="5">
<h2 data-number="5" class="anchored" data-anchor-id="lessons-learned"><span class="header-section-number">5</span> Lessons Learned</h2>
<p>After going through this transition, it became clear that the real challenge is not learning a tool, but understanding the system behind it. What initially looked like a simple publishing workflow turned out to involve multiple layers—each with its own assumptions, constraints, and trade-offs.</p>
<p>Several key lessons emerged from this process.</p>
<section id="reproducibility-is-more-than-code" class="level3" data-number="5.1">
<h3 data-number="5.1" class="anchored" data-anchor-id="reproducibility-is-more-than-code"><span class="header-section-number">5.1</span> Reproducibility Is More Than Code</h3>
<p>It is easy to assume that a project is reproducible if the code runs successfully. In reality, reproducibility depends on much more than that.</p>
<p>It includes the execution environment, the dependencies, the system configuration, and even the availability of external resources.</p>
<blockquote class="blockquote">
<p>A project is reproducible only if its environment is reproducible.</p>
</blockquote>
<hr>
</section>
<section id="simplicity-improves-reliability" class="level3" data-number="5.2">
<h3 data-number="5.2" class="anchored" data-anchor-id="simplicity-improves-reliability"><span class="header-section-number">5.2</span> Simplicity Improves Reliability</h3>
<p>As a project grows, there is a natural tendency to add features, outputs, and integrations. However, every additional component increases the complexity of the pipeline. In practice, simpler workflows tend to be more robust and easier to maintain.</p>
<blockquote class="blockquote">
<p>The simpler the pipeline, the more reliable the deployment.</p>
</blockquote>
<hr>
</section>
<section id="external-dependencies-should-be-minimized" class="level3" data-number="5.3">
<h3 data-number="5.3" class="anchored" data-anchor-id="external-dependencies-should-be-minimized"><span class="header-section-number">5.3</span> External Dependencies Should Be Minimized</h3>
<p>External services, APIs, and remote data sources are powerful, but they introduce uncertainty. They depend on factors that are outside the control of the project:</p>
<ul>
<li>network conditions</li>
<li>service availability</li>
<li>response times</li>
</ul>
<p>Reducing reliance on external components—especially during deployment—can significantly improve stability.</p>
<hr>
</section>
<section id="local-does-not-equal-production" class="level3" data-number="5.4">
<h3 data-number="5.4" class="anchored" data-anchor-id="local-does-not-equal-production"><span class="header-section-number">5.4</span> Local Does Not Equal Production</h3>
<p>One of the most common misconceptions in development is assuming that local success guarantees global success.</p>
<p>Different environments behave differently. What works in one context may fail in another without any changes in the code.</p>
<blockquote class="blockquote">
<p>If it works on your machine, it only proves that it works on your machine.</p>
</blockquote>
<hr>
</section>
<section id="build-time-is-a-signal" class="level3" data-number="5.5">
<h3 data-number="5.5" class="anchored" data-anchor-id="build-time-is-a-signal"><span class="header-section-number">5.5</span> Build Time Is a Signal</h3>
<p>Long build times are not just an inconvenience. They often indicate underlying issues:</p>
<ul>
<li>unnecessary computations</li>
<li>inefficient workflows</li>
<li>excessive dependencies</li>
</ul>
<p>Instead of treating build time as a secondary concern, it should be seen as a signal that something in the pipeline can be improved.</p>
<hr>
<p>Taken together, these lessons shift the perspective from “how to deploy a website” to a more meaningful question:</p>
<blockquote class="blockquote">
<p><em>How to design a workflow that is stable, reproducible, and sustainable over time?</em></p>
</blockquote>
</section>
</section>
<section id="netlify-vs-github-pages" class="level2" data-number="6">
<h2 data-number="6" class="anchored" data-anchor-id="netlify-vs-github-pages"><span class="header-section-number">6</span> Netlify vs GitHub Pages</h2>
<p>After working with both platforms, the differences become clearer when viewed from a practical perspective rather than a purely technical one.</p>
<p>Both Netlify and GitHub Pages are capable solutions for publishing Quarto websites. However, they are built around different assumptions, and those assumptions become more visible as a project grows.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 27%">
<col style="width: 27%">
<col style="width: 44%">
</colgroup>
<thead>
<tr class="header">
<th>Feature</th>
<th>Netlify</th>
<th>GitHub Pages</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Initial setup</td>
<td>Very easy</td>
<td>Moderate</td>
</tr>
<tr class="even">
<td>Deployment model</td>
<td>Managed build service</td>
<td>Repository-driven workflow</td>
</tr>
<tr class="odd">
<td>Resource limits</td>
<td>Present (especially on free tiers)</td>
<td>No strict limits for typical use</td>
</tr>
<tr class="even">
<td>Control over pipeline</td>
<td>Limited</td>
<td>High</td>
</tr>
<tr class="odd">
<td>Debugging visibility</td>
<td>Restricted</td>
<td>Detailed logs and transparency</td>
</tr>
<tr class="even">
<td>Suitability for data-driven projects</td>
<td>Limited</td>
<td>More flexible</td>
</tr>
</tbody>
</table>
<p>Netlify excels in simplicity. For lightweight static sites, documentation pages, or personal blogs with minimal computation, it provides a smooth and efficient experience. The setup is fast, and the platform handles most of the deployment process automatically.</p>
<p>GitHub Pages, on the other hand, offers greater control. While it may require more initial effort, it provides a clearer view of the build process and allows more flexibility in handling dependencies, workflows, and project structure.</p>
<p>The difference becomes especially important for Quarto projects that include code execution, data processing, or multiple outputs. In such cases, having visibility and control over the pipeline can make a significant difference in both stability and maintainability.</p>
<hr>
</section>
<section id="which-one-should-you-choose" class="level2" data-number="7">
<h2 data-number="7" class="anchored" data-anchor-id="which-one-should-you-choose"><span class="header-section-number">7</span> Which One Should You Choose?</h2>
<p>There is no single correct answer, but there is a practical way to think about the choice.</p>
<ul>
<li>If your project is a simple static blog with minimal computation, Netlify is often the most convenient option.</li>
<li>If your project involves data processing, code execution, or a more complex workflow, GitHub Pages tends to offer a more sustainable solution.</li>
</ul>
<p>Ultimately, the decision is less about the platform itself and more about the nature of the project.</p>
<hr>
</section>
<section id="final-thoughts" class="level2" data-number="8">
<h2 data-number="8" class="anchored" data-anchor-id="final-thoughts"><span class="header-section-number">8</span> Final Thoughts</h2>
<p>Publishing a Quarto blog is easy. Maintaining it as a real-world project is not. As soon as a project moves beyond a simple example, deployment becomes part of the system design. It requires thinking about environments, dependencies, workflows, and long-term sustainability. The tools themselves are not the challenge. The challenge is understanding how they interact. Once that becomes clear, the process becomes not only manageable, but also much more intentional. In that sense, deployment is no longer just a final step. It is part of the architecture.</p>


<!-- -->


</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-bibliography"><h2 class="anchored quarto-appendix-heading">References</h2><div id="refs" class="references csl-bib-body hanging-indent">
<div id="ref-quartoWebsites" class="csl-entry">
Posit PBC. 2026a. <span>“Creating a Website.”</span> <a href="https://quarto.org/docs/websites/">https://quarto.org/docs/websites/</a>.
</div>
<div id="ref-quartoPublishing" class="csl-entry">
Posit PBC. 2026b. <span>“Publishing Basics.”</span> <a href="https://quarto.org/docs/publishing/">https://quarto.org/docs/publishing/</a>.
</div>
</div></section></div> ]]></description>
  <category>Quarto</category>
  <category>R</category>
  <category>GitHub Pages</category>
  <category>Netlify</category>
  <category>Data Science</category>
  <guid>https://mfatihtuzen.github.io/posts/2026-04-24_quarto_blog_github/</guid>
  <pubDate>Fri, 24 Apr 2026 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Why Most Time Series Models Fail Before They Start</title>
  <dc:creator>M. Fatih Tüzen</dc:creator>
  <link>https://mfatihtuzen.github.io/posts/2026-04-16_timeseries_stationary/</link>
  <description><![CDATA[ 






<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://mfatihtuzen.github.io/posts/2026-04-16_timeseries_stationary/timeseries_stationary.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<section id="a-model-can-run-and-still-be-fundamentally-wrong" class="level2" data-number="1">
<h2 data-number="1" class="anchored" data-anchor-id="a-model-can-run-and-still-be-fundamentally-wrong"><span class="header-section-number">1</span> A model can run and still be fundamentally wrong</h2>
<p>Many time series models fail before they even begin. Not because the software crashes. Not because the code is wrong. But because the data entering the model violate one of the most important assumptions in time series analysis: <strong>stationarity</strong>.</p>
<p>This is where many analyses quietly go off the rails. A model is estimated, forecasts are produced, coefficients look serious, and the graphs appear convincing. But the model may be chasing a moving target rather than learning a stable data-generating mechanism.</p>
<p>In this post, we will work with a real macroeconomic series rather than a toy example. The data come from the <strong>Consumer Price Index for All Urban Consumers: All Items (CPIAUCSL)</strong>, published by the U.S. Bureau of Labor Statistics and distributed through FRED. FRED describes CPIAUCSL as a monthly, seasonally adjusted price index and notes that percent changes in the index are commonly used to measure inflation.</p>
<p>Because live API access may fail in some institutional or offline environments, this workflow uses a <strong>locally downloaded CSV file</strong> instead of fetching the series on the fly. You can download the file directly from the <a href="https://fred.stlouisfed.org/series/CPIAUCSL">CPIAUCSL page on FRED</a>.</p>
<p>The goal is simple: show why raw time series levels often mislead us, what stationarity really means, and why transformations such as differencing and log-differencing are not cosmetic tricks but conceptual necessities.</p>
</section>
<section id="what-stationarity-really-means" class="level2" data-number="2">
<h2 data-number="2" class="anchored" data-anchor-id="what-stationarity-really-means"><span class="header-section-number">2</span> What stationarity really means</h2>
<p>In informal language, a stationary series is one whose behavior does not drift in a systematic way over time. More formally, a weakly stationary process (<img src="https://latex.codecogs.com/png.latex?X_t">) satisfies three conditions:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AE(X_t)%20=%20%5Cmu%0A"></p>
<p><img src="https://latex.codecogs.com/png.latex?%0AVar(X_t)%20=%20%5Csigma%5E2%0A"></p>
<p><img src="https://latex.codecogs.com/png.latex?%0ACov(X_t,%20X_%7Bt-k%7D)%20=%20%5Cgamma_k%0A"></p>
<p>The first condition says the mean does not change over time. The second says the variance is constant. The third says the covariance between observations depends only on the lag (k), not on calendar time itself.</p>
<p>This matters because a large part of classical time series modeling is built on the idea that the stochastic structure is stable. When that structure is drifting, many familiar tools become unreliable or at least much harder to interpret. A trending series can generate strong autocorrelation even when the underlying dynamic structure is weak. A persistent upward path can trick the analyst into seeing “model fit” where the model is merely inheriting inertia from the level of the series.</p>
<p>Put differently: without stationarity, a model may explain movement without actually explaining the mechanism.</p>
</section>
<section id="load-the-cpi-data-from-a-csv-file" class="level2" data-number="3">
<h2 data-number="3" class="anchored" data-anchor-id="load-the-cpi-data-from-a-csv-file"><span class="header-section-number">3</span> Load the CPI data from a CSV file</h2>
<p>Download the CSV file for <strong>CPIAUCSL</strong> from the official FRED series page and save it in your working directory with the name <code>CPIAUCSL.csv</code>. The file typically includes the columns <code>observation_date</code> and <code>CPIAUCSL</code>. FRED is the distribution platform, while the source agency for the series is the U.S. Bureau of Labor Statistics.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(readr)</span>
<span id="cb1-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(dplyr)</span>
<span id="cb1-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(ggplot2)</span>
<span id="cb1-4"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(tibble)</span>
<span id="cb1-5"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(zoo)</span>
<span id="cb1-6"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(scales)</span>
<span id="cb1-7"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(patchwork)</span>
<span id="cb1-8"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(tseries)</span>
<span id="cb1-9"></span>
<span id="cb1-10">cpi_tbl <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">read_csv</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"CPIAUCSL.csv"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">show_col_types =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb1-11">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">transmute</span>(</span>
<span id="cb1-12">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">date =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">as.Date</span>(observation_date),</span>
<span id="cb1-13">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">cpi  =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">as.numeric</span>(CPIAUCSL)</span>
<span id="cb1-14">  ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb1-15">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">arrange</span>(date) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb1-16">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(date), <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(cpi))</span>
<span id="cb1-17"></span>
<span id="cb1-18">cpi_tbl <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">slice_head</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">n =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code># A tibble: 5 × 2
  date         cpi
  &lt;date&gt;     &lt;dbl&gt;
1 1947-01-01  21.5
2 1947-02-01  21.6
3 1947-03-01  22  
4 1947-04-01  22  
5 1947-05-01  22.0</code></pre>
</div>
</div>
<p>The line <code>filter(!is.na(date), !is.na(cpi))</code> is important. If your CSV has an <code>NA</code> for a month such as October 2025, that observation is safely excluded from the analysis instead of silently breaking the workflow.</p>
</section>
<section id="start-with-the-visual-story-not-the-test-statistic" class="level2" data-number="4">
<h2 data-number="4" class="anchored" data-anchor-id="start-with-the-visual-story-not-the-test-statistic"><span class="header-section-number">4</span> Start with the visual story, not the test statistic</h2>
<p>In time series analysis, the first serious diagnostic is often visual rather than formal. That is not because tests are unimportant. It is because plots let us see the basic character of the data before we start compressing everything into a p-value.</p>
<p>If a series has a visible trend, changing volatility, sudden level shifts, or unusual gaps, that already tells us something about whether a stationary model is likely to behave well.</p>
<section id="the-raw-cpi-level" class="level3" data-number="4.1">
<h3 data-number="4.1" class="anchored" data-anchor-id="the-raw-cpi-level"><span class="header-section-number">4.1</span> The raw CPI level</h3>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb3-1">p_level <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggplot</span>(cpi_tbl, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> date, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> cpi)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb3-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_line</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">linewidth =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.9</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">color =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#1B4965"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb3-3">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">labs</span>(</span>
<span id="cb3-4">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">title =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"U.S. CPI (CPIAUCSL): level series"</span>,</span>
<span id="cb3-5">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">subtitle =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Monthly, seasonally adjusted index from FRED"</span>,</span>
<span id="cb3-6">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NULL</span>,</span>
<span id="cb3-7">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Index"</span></span>
<span id="cb3-8">  ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb3-9">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">scale_y_continuous</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">labels =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">label_number</span>()) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb3-10">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme_minimal</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">base_size =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">12</span>)</span>
<span id="cb3-11"></span>
<span id="cb3-12">p_level</span></code></pre></div></div>
<div class="cell-output-display">
<div>
<figure class="figure">
<p><img src="https://mfatihtuzen.github.io/posts/2026-04-16_timeseries_stationary/index_files/figure-html/unnamed-chunk-2-1.png" class="img-fluid figure-img" width="672"></p>
</figure>
</div>
</div>
</div>
<p>Even before applying a formal statistical test, the visual pattern already tells us something important. The CPI level series does not oscillate around a stable mean; instead, it follows a persistent upward path over time. This alone raises an immediate warning against modeling the raw level series as if it were stationary.</p>
<p>The graph also suggests that the increase is not perfectly uniform across the entire sample. In some periods, the slope becomes steeper, indicating faster price growth, while in others the series evolves more gradually. In other words, the series appears to contain not only a long-run trend but also changes in inflation dynamics over time.</p>
<p>This is precisely why visual inspection should be the first step in time series analysis. Before looking at test statistics or fitting a model, we should ask a simpler question: does the series <em>look</em> like it fluctuates around a constant level? In this case, the answer is clearly no.</p>
<p>A smooth and steadily rising curve may look statistically innocent at first glance, but in practice it is often a sign that the raw series is carrying trend information that must be addressed before modeling.</p>
</section>
<section id="rolling-summaries-to-deepen-the-visual-diagnosis" class="level3" data-number="4.2">
<h3 data-number="4.2" class="anchored" data-anchor-id="rolling-summaries-to-deepen-the-visual-diagnosis"><span class="header-section-number">4.2</span> Rolling summaries to deepen the visual diagnosis</h3>
<p>A single line plot is useful, but local summaries make the visual argument sharper. Below, I compute a 24-month rolling mean and rolling standard deviation.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb4-1">cpi_roll <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> cpi_tbl <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb4-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(</span>
<span id="cb4-3">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">roll_mean_24 =</span> zoo<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rollmean</span>(cpi, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">k =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">24</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">fill =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NA</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">align =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"right"</span>),</span>
<span id="cb4-4">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">roll_sd_24   =</span> zoo<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rollapply</span>(cpi, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">width =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">24</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">FUN =</span> sd, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">fill =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NA</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">align =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"right"</span>)</span>
<span id="cb4-5">  )</span>
<span id="cb4-6"></span>
<span id="cb4-7">p_roll_mean <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggplot</span>(cpi_roll, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(date, roll_mean_24)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb4-8">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_line</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">linewidth =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.9</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">color =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#2A9D8F"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb4-9">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">labs</span>(</span>
<span id="cb4-10">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">title =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"24-month rolling mean of CPI"</span>,</span>
<span id="cb4-11">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NULL</span>,</span>
<span id="cb4-12">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Rolling mean"</span></span>
<span id="cb4-13">  ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb4-14">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme_minimal</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">base_size =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">12</span>)</span>
<span id="cb4-15"></span>
<span id="cb4-16">p_roll_sd <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggplot</span>(cpi_roll, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(date, roll_sd_24)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb4-17">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_line</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">linewidth =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.9</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">color =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#E76F51"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb4-18">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">labs</span>(</span>
<span id="cb4-19">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">title =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"24-month rolling standard deviation of CPI"</span>,</span>
<span id="cb4-20">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NULL</span>,</span>
<span id="cb4-21">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Rolling SD"</span></span>
<span id="cb4-22">  ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb4-23">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme_minimal</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">base_size =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">12</span>)</span>
<span id="cb4-24"></span>
<span id="cb4-25">p_roll_mean <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> p_roll_sd</span></code></pre></div></div>
<div class="cell-output-display">
<div>
<figure class="figure">
<p><img src="https://mfatihtuzen.github.io/posts/2026-04-16_timeseries_stationary/index_files/figure-html/unnamed-chunk-3-1.png" class="img-fluid figure-img" width="672"></p>
</figure>
</div>
</div>
</div>
<p>If the series were approximately stationary, we would expect these rolling statistics to fluctuate around relatively stable levels over time. In particular, the rolling mean should remain close to a constant value, and the rolling standard deviation should not exhibit systematic shifts.</p>
<p>However, the evidence here points in the opposite direction. The rolling mean shows a clear and persistent upward drift, reinforcing what we observed in the raw series: the central tendency is not stable, but evolving over time.</p>
<p>The rolling standard deviation tells a more nuanced story. While it remains relatively moderate for long periods, there are noticeable fluctuations and, more importantly, a pronounced spike in recent years. This indicates that the variability of the series is not constant and may respond to underlying economic conditions or shocks.</p>
<p>Taken together, these two plots suggest that the series violates the key assumptions of stationarity—both in terms of mean and variance. While rolling statistics alone do not formally prove non-stationarity, they provide strong visual evidence that the raw series is not suitable for direct modeling without transformation.</p>
</section>
</section>
<section id="why-raw-cpi-levels-are-a-good-example" class="level2" data-number="5">
<h2 data-number="5" class="anchored" data-anchor-id="why-raw-cpi-levels-are-a-good-example"><span class="header-section-number">5</span> Why raw CPI levels are a good example</h2>
<p>CPI is ideal for illustrating this problem because the level series typically trends upward over time. That is not a defect in the data; it is what a price index often does. But from a modeling perspective, it creates trouble.</p>
<p>If the level keeps drifting upward, then the mean is not constant. If the size of movements changes as the level rises, the variance may also appear unstable. In such a setting, fitting a model directly to the raw series can mix long-run inflationary drift with short-run dynamic behavior.</p>
<p>Economically, analysts are usually not interested in the index level itself as much as they are interested in <strong>inflation</strong>, that is, the rate at which the price level changes. Statistically, this is convenient too, because transforming the series from levels to changes often brings it closer to stationarity.</p>
</section>
<section id="a-statistical-check-the-augmented-dickey-fuller-test" class="level2" data-number="6">
<h2 data-number="6" class="anchored" data-anchor-id="a-statistical-check-the-augmented-dickey-fuller-test"><span class="header-section-number">6</span> A statistical check: the Augmented Dickey-Fuller test</h2>
<p>Visual diagnosis matters, but it is usually not enough. A commonly used statistical tool is the <strong>Augmented Dickey-Fuller (ADF) test</strong>, which tests for the presence of a unit root. In practical terms, the test is often used to assess whether a series behaves like a non-stationary process with persistent stochastic trend.</p>
<p>The null hypothesis of the ADF test is that the series has a unit root. That means the burden of proof is asymmetric:</p>
<ul>
<li>a <strong>large</strong> p-value means we do <strong>not</strong> have strong evidence against non-stationarity,</li>
<li>a <strong>small</strong> p-value means the data are more consistent with stationarity.</li>
</ul>
<p>That distinction is easy to say and easy to misuse. Failing to reject the null is not the same thing as proving a series is non-stationary beyond all doubt. It simply means the test did not find enough evidence against the unit-root view.</p>
<p>Let us start with the raw CPI level.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb5-1">adf_level <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> tseries<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">adf.test</span>(cpi_tbl<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>cpi)</span>
<span id="cb5-2">adf_level</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>
    Augmented Dickey-Fuller Test

data:  cpi_tbl$cpi
Dickey-Fuller = -0.1813, Lag order = 9, p-value = 0.99
alternative hypothesis: stationary</code></pre>
</div>
</div>
<p>The Augmented Dickey–Fuller (ADF) test provides a formal way to assess whether the series contains a unit root. The null hypothesis of the test is that the series is non-stationary (i.e., it has a unit root), while the alternative hypothesis is stationarity.</p>
<p>In this case, the p-value is extremely high (p ≈ 0.99), meaning that we fail to reject the null hypothesis. In other words, there is no statistical evidence to support that the CPI level series is stationary.</p>
<p>However, this result should not be interpreted in isolation. Statistical tests and visual diagnostics should complement each other. The high p-value is entirely consistent with what we observed earlier: the series exhibits a strong upward trend and does not fluctuate around a constant mean.</p>
<p>Taken together, both the visual evidence and the ADF test point to the same conclusion — the raw CPI level behaves more like a drifting (unit root) process than a stationary one. This reinforces the need for transforming the series before attempting any meaningful modeling.</p>
</section>
<section id="the-first-rescue-differencing" class="level2" data-number="7">
<h2 data-number="7" class="anchored" data-anchor-id="the-first-rescue-differencing"><span class="header-section-number">7</span> The first rescue: differencing</h2>
<p>One of the oldest and most important ideas in time series analysis is that differencing can remove certain forms of trend. The first difference is</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5CDelta%20X_t%20=%20X_t%20-%20X_%7Bt-1%7D%0A"></p>
<p>This transformation asks a different question. Instead of modeling the level, we model the change from one period to the next.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb7-1">cpi_diff_tbl <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> cpi_tbl <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb7-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">diff_cpi =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NA</span>, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">diff</span>(cpi))) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb7-3">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(diff_cpi))</span>
<span id="cb7-4"></span>
<span id="cb7-5">p_diff <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggplot</span>(cpi_diff_tbl, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> date, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> diff_cpi)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb7-6">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_line</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">linewidth =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.8</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">color =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#6D597A"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb7-7">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">labs</span>(</span>
<span id="cb7-8">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">title =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"First difference of CPI"</span>,</span>
<span id="cb7-9">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">subtitle =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Absolute month-to-month change in the index"</span>,</span>
<span id="cb7-10">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NULL</span>,</span>
<span id="cb7-11">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">expression</span>(Delta<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span>CPI)</span>
<span id="cb7-12">  ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb7-13">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme_minimal</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">base_size =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">12</span>)</span>
<span id="cb7-14"></span>
<span id="cb7-15">p_diff</span></code></pre></div></div>
<div class="cell-output-display">
<div>
<figure class="figure">
<p><img src="https://mfatihtuzen.github.io/posts/2026-04-16_timeseries_stationary/index_files/figure-html/unnamed-chunk-5-1.png" class="img-fluid figure-img" width="672"></p>
</figure>
</div>
</div>
</div>
<p>Taking the first difference removes a large part of the visible trend in the series. Compared to the raw CPI level, the differenced series fluctuates much more around a relatively stable center, which is an encouraging sign from a modeling perspective.</p>
<p>However, differencing does not fully solve the problem. While it helps stabilize the mean, the variability of the series still appears to change over time, particularly in more recent periods where larger fluctuations are observed. This suggests that the series may still violate the constant variance assumption.</p>
<p>There is also a more subtle but important issue: interpretation. The first difference represents absolute changes in the index, not relative ones. In macroeconomic data, a one-point increase in CPI does not carry the same meaning when the index is around 100 versus when it exceeds 300. As the scale of the series grows, the same absolute change reflects a smaller proportional movement.</p>
<p>In other words, differencing improves the statistical properties of the series, but it does not yet provide a fully consistent or interpretable measure of change. This is why we often go one step further and consider transformations based on relative (percentage) changes.</p>
</section>
<section id="the-more-meaningful-rescue-log-differences" class="level2" data-number="8">
<h2 data-number="8" class="anchored" data-anchor-id="the-more-meaningful-rescue-log-differences"><span class="header-section-number">8</span> The more meaningful rescue: log differences</h2>
<p>This is where the log transformation becomes more than a technical detail. Consider</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5CDelta%20%5Clog(X_t)%20=%20%5Clog(X_t)%20-%20%5Clog(X_%7Bt-1%7D)%0A"></p>
<p>For moderate changes, this is approximately the proportional growth rate. In the CPI context, it moves us from the language of index levels toward the language of inflation.</p>
<p>That shift is both statistical and economic.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb8-1">cpi_log_tbl <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> cpi_tbl <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb8-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(</span>
<span id="cb8-3">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">log_cpi =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">log</span>(cpi),</span>
<span id="cb8-4">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">dlog_cpi =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NA</span>, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">diff</span>(log_cpi)),</span>
<span id="cb8-5">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">annualized_inflation_pct =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1200</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> dlog_cpi,</span>
<span id="cb8-6">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">yoy_inflation_pct =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> (cpi <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lag</span>(cpi, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">12</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)</span>
<span id="cb8-7">  )</span>
<span id="cb8-8"></span>
<span id="cb8-9">p_dlog <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> cpi_log_tbl <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb8-10">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(annualized_inflation_pct)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb8-11">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggplot</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> date, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> annualized_inflation_pct)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb8-12">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_line</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">linewidth =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.8</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">color =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#D62828"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb8-13">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">labs</span>(</span>
<span id="cb8-14">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">title =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Monthly log-difference of CPI (annualized)"</span>,</span>
<span id="cb8-15">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">subtitle =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"A close cousin of short-run inflation"</span>,</span>
<span id="cb8-16">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NULL</span>,</span>
<span id="cb8-17">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Percent"</span></span>
<span id="cb8-18">  ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb8-19">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme_minimal</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">base_size =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">12</span>)</span>
<span id="cb8-20"></span>
<span id="cb8-21">p_yoy <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> cpi_log_tbl <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb8-22">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(yoy_inflation_pct)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb8-23">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggplot</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> date, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> yoy_inflation_pct)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb8-24">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_line</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">linewidth =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.8</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">color =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#F4A261"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb8-25">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">labs</span>(</span>
<span id="cb8-26">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">title =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Year-over-year CPI inflation"</span>,</span>
<span id="cb8-27">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">subtitle =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"A slower-moving inflation measure"</span>,</span>
<span id="cb8-28">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NULL</span>,</span>
<span id="cb8-29">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Percent"</span></span>
<span id="cb8-30">  ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb8-31">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme_minimal</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">base_size =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">12</span>)</span>
<span id="cb8-32"></span>
<span id="cb8-33">p_dlog <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> p_yoy</span></code></pre></div></div>
<div class="cell-output-display">
<div>
<figure class="figure">
<p><img src="https://mfatihtuzen.github.io/posts/2026-04-16_timeseries_stationary/index_files/figure-html/unnamed-chunk-6-1.png" class="img-fluid figure-img" width="672"></p>
</figure>
</div>
</div>
</div>
<p>Two key insights emerge from these transformations.</p>
<p>First, moving from levels to rates of change fundamentally improves interpretability. The log-difference series represents approximate percentage changes — in this context, a close proxy for short-run inflation. This is the quantity economists actually care about. A 1% increase has the same meaning regardless of whether the index is at 100 or 300, making comparisons over time much more meaningful.</p>
<p>Second, the transformation has a clear impact on the statistical properties of the series. Compared to the raw level and even the first difference, the log-differenced series fluctuates more consistently around a stable mean. While it still exhibits volatility spikes and occasional outliers, the overall behavior is much closer to what we would expect from a stationary process.</p>
<p>The comparison between the two plots is also instructive. The monthly log-difference captures short-term fluctuations and reacts quickly to shocks, while the year-over-year inflation series smooths out this noise and highlights longer-term inflation dynamics. Both are useful, but they answer different questions.</p>
<p>To put it bluntly: you did not just transform the data — you changed the question.</p>
</section>
<section id="re-test-after-transformation" class="level2" data-number="9">
<h2 data-number="9" class="anchored" data-anchor-id="re-test-after-transformation"><span class="header-section-number">9</span> Re-test after transformation</h2>
<p>Let us apply the ADF test again, this time to the log-differenced series.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb9-1">adf_dlog <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> cpi_log_tbl <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb9-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(dlog_cpi)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb9-3">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pull</span>(dlog_cpi) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb9-4">  tseries<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">adf.test</span>()</span>
<span id="cb9-5"></span>
<span id="cb9-6">adf_dlog</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>
    Augmented Dickey-Fuller Test

data:  .
Dickey-Fuller = -4.3862, Lag order = 9, p-value = 0.01
alternative hypothesis: stationary</code></pre>
</div>
</div>
<p>The contrast between the two ADF test results is striking and highly informative.</p>
<p>For the raw CPI level, we failed to reject the null hypothesis of a unit root, indicating that the series behaves as a non-stationary process. In contrast, for the log-differenced series, the p-value drops to around 0.01, allowing us to reject the null hypothesis and conclude that the transformed series is consistent with stationarity.</p>
<p>This shift is not just a technical detail — it reflects a fundamental change in how the data behaves. The transformation has effectively removed the persistent trend component and brought the series closer to a stable statistical structure.</p>
<p>That said, the test result should always be interpreted alongside the visual evidence. The ADF test provides formal confirmation, but the intuition comes from the plots. What we saw visually — a drifting level series versus a mean-reverting transformed series — is now supported by statistical testing.</p>
<p>In essence, the workflow comes full circle:<br>
we start with a problematic series, diagnose the issue visually, apply a transformation, and then verify the improvement formally.</p>
<p>This is the core of time series thinking.</p>
</section>
<section id="a-subtle-but-crucial-point-transformation-changes-interpretation" class="level2" data-number="10">
<h2 data-number="10" class="anchored" data-anchor-id="a-subtle-but-crucial-point-transformation-changes-interpretation"><span class="header-section-number">10</span> A subtle but crucial point: transformation changes interpretation</h2>
<p>This is the point where many explanations remain superficial.</p>
<p>When you difference a series, you are not merely “cleaning” it — you are redefining the object of analysis.</p>
<ul>
<li>Modeling <strong>CPI levels</strong> asks how the price index evolves over time.</li>
<li>Modeling <strong>first differences</strong> asks how much the index changes from one period to the next.</li>
<li>Modeling <strong>log differences</strong> asks about proportional change, which is directly linked to inflation.</li>
</ul>
<p>These are not equivalent statistical questions, and they are certainly not equivalent economic questions.</p>
<p>This is why time series preprocessing should never be treated as a mechanical step. Every transformation involves a trade-off: it improves certain statistical properties while simultaneously altering the meaning of the data.</p>
<p>Understanding that trade-off is not optional — it is central to sound time series analysis.</p>
</section>
<section id="why-this-matters-for-arima-style-modeling" class="level2" data-number="11">
<h2 data-number="11" class="anchored" data-anchor-id="why-this-matters-for-arima-style-modeling"><span class="header-section-number">11</span> Why this matters for ARIMA-style modeling</h2>
<p>ARIMA models are often presented as if the workflow were mechanical: inspect the series, difference if needed, identify orders, estimate parameters, check residuals, and forecast. While this workflow is useful, it can create the misleading impression that differencing is simply a procedural step — a box to tick.</p>
<p>It is not.</p>
<p>Differencing is a deliberate modeling choice. Its purpose is to separate persistent, trend-like behavior from shorter-run dynamics. If you skip it when it is needed, your model may inherit non-stationarity and produce unreliable or misleading inference. If you apply it excessively, you risk removing meaningful structure and end up modeling noise.</p>
<p>The real question, therefore, is not “Should I difference?” but rather:<br>
<strong>What feature of the data am I trying to stabilize, and what question do I want the model to answer?</strong></p>
</section>
<section id="a-compact-comparison" class="level2" data-number="12">
<h2 data-number="12" class="anchored" data-anchor-id="a-compact-comparison"><span class="header-section-number">12</span> A compact comparison</h2>
<table class="caption-top table">
<colgroup>
<col style="width: 23%">
<col style="width: 23%">
<col style="width: 23%">
<col style="width: 28%">
</colgroup>
<thead>
<tr class="header">
<th>Series version</th>
<th>What it represents</th>
<th>Typical issue</th>
<th>When it helps (and when it does not)</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>CPI level</td>
<td>The price index itself</td>
<td>Strong trend, likely unit root</td>
<td>Poor starting point for stationary modeling</td>
</tr>
<tr class="even">
<td>First difference</td>
<td>Absolute period-to-period change</td>
<td>Still scale-dependent</td>
<td>Reduces trend, but interpretation remains limited</td>
</tr>
<tr class="odd">
<td>Log difference</td>
<td>Approximate proportional change</td>
<td>May still show volatility bursts</td>
<td>More suitable for modeling inflation-type dynamics</td>
</tr>
<tr class="even">
<td>Year-over-year change</td>
<td>Annual percentage change</td>
<td>Smoother, less responsive</td>
<td>Useful for communication, less suited for short-run analysis</td>
</tr>
</tbody>
</table>
</section>
<section id="common-mistakes" class="level2" data-number="13">
<h2 data-number="13" class="anchored" data-anchor-id="common-mistakes"><span class="header-section-number">13</span> Common mistakes</h2>
<p>Most mistakes in time series analysis are not computational — they are conceptual.</p>
<p><strong>Mistake 1: fitting models directly to raw levels because the plot “looks smooth.”</strong><br>
Smoothness is not stationarity. A strong trend can produce visually smooth series that are statistically problematic.</p>
<p><strong>Mistake 2: treating differencing as a harmless default.</strong><br>
Differencing changes the meaning of the data. It may improve statistical properties while quietly reducing interpretability if applied without care.</p>
<p><strong>Mistake 3: relying on a single test result as final truth.</strong><br>
The ADF test is useful, but it is only one piece of evidence. Visual inspection, domain knowledge, structural breaks, and alternative tests all matter.</p>
<p><strong>Mistake 4: forgetting the economics.</strong><br>
In the case of CPI, the focus is typically on inflation, not the index level itself. A good transformation is one that improves statistical validity while remaining aligned with the economic question.</p>
<p>Taken together, these mistakes point to a simple lesson:<br>
<strong>time series analysis is not about applying steps — it is about making informed choices.</strong></p>
</section>
<section id="final-thoughts" class="level2" data-number="14">
<h2 data-number="14" class="anchored" data-anchor-id="final-thoughts"><span class="header-section-number">14</span> Final thoughts</h2>
<p>Most time series models do not fail because we cannot estimate them. They fail because we model the wrong object.</p>
<p>The raw CPI series is a clear reminder that not every observed series is ready for modeling. A trending index is rarely an appropriate input for a stationary model. Once we difference — and especially log-difference — the data, the series becomes more interpretable, more stable, and much closer to the type of process that classical time series methods are designed to handle.</p>
<p>So before asking whether your model is sophisticated enough, ask a more fundamental question:</p>
<p><strong>Am I modeling a stable process — or just chasing drift?</strong></p>
<p>In many cases, the answer to this question matters far more than whether you choose AR(1), ARIMA(1,1,1), or any other fashionable specification.</p>
</section>
<section id="references-and-further-reading" class="level2" data-number="15">
<h2 data-number="15" class="anchored" data-anchor-id="references-and-further-reading"><span class="header-section-number">15</span> References and further reading</h2>
<section id="data-sources" class="level3" data-number="15.1">
<h3 data-number="15.1" class="anchored" data-anchor-id="data-sources"><span class="header-section-number">15.1</span> Data sources</h3>
<ul>
<li><p>FRED, Federal Reserve Bank of St.&nbsp;Louis. <em>Consumer Price Index for All Urban Consumers: All Items (CPIAUCSL).</em><br>
<a href="https://fred.stlouisfed.org/series/CPIAUCSL" class="uri">https://fred.stlouisfed.org/series/CPIAUCSL</a></p></li>
<li><p>FRED API documentation. <em>St.&nbsp;Louis Fed Web Services: FRED® API.</em><br>
<a href="https://fred.stlouisfed.org/docs/api/fred/" class="uri">https://fred.stlouisfed.org/docs/api/fred/</a></p></li>
</ul>
<hr>
</section>
<section id="core-time-series-references" class="level3" data-number="15.2">
<h3 data-number="15.2" class="anchored" data-anchor-id="core-time-series-references"><span class="header-section-number">15.2</span> Core time series references</h3>
<ul>
<li><p>Box, G. E. P., Jenkins, G. M., Reinsel, G. C., &amp; Ljung, G. M. (2015). <em>Time Series Analysis: Forecasting and Control.</em> Wiley.</p></li>
<li><p>Hyndman, R. J., &amp; Athanasopoulos, G. (2021). <em>Forecasting: Principles and Practice (3rd ed.).</em><br>
<a href="https://otexts.com/fpp3/" class="uri">https://otexts.com/fpp3/</a></p></li>
<li><p>Hamilton, J. D. (1994). <em>Time Series Analysis.</em> Princeton University Press.</p></li>
</ul>
<hr>
</section>
<section id="stationarity-and-unit-root-testing" class="level3" data-number="15.3">
<h3 data-number="15.3" class="anchored" data-anchor-id="stationarity-and-unit-root-testing"><span class="header-section-number">15.3</span> Stationarity and unit root testing</h3>
<ul>
<li><p>Dickey, D. A., &amp; Fuller, W. A. (1979). <em>Distribution of the estimators for autoregressive time series with a unit root.</em> Journal of the American Statistical Association.</p></li>
<li><p>Said, S. E., &amp; Dickey, D. A. (1984). <em>Testing for unit roots in autoregressive-moving average models of unknown order.</em> Biometrika.</p></li>
</ul>
<hr>
</section>
<section id="transformations-and-interpretation" class="level3" data-number="15.4">
<h3 data-number="15.4" class="anchored" data-anchor-id="transformations-and-interpretation"><span class="header-section-number">15.4</span> Transformations and interpretation</h3>
<ul>
<li><p>Stock, J. H., &amp; Watson, M. W. (2019). <em>Introduction to Econometrics.</em> Pearson.</p></li>
<li><p>Tsay, R. S. (2010). <em>Analysis of Financial Time Series.</em> Wiley.</p></li>
</ul>
<hr>
</section>
<section id="practical-r-resources" class="level3" data-number="15.5">
<h3 data-number="15.5" class="anchored" data-anchor-id="practical-r-resources"><span class="header-section-number">15.5</span> Practical R resources</h3>
<ul>
<li><p>R Core Team. <em>R: A Language and Environment for Statistical Computing.</em><br>
<a href="https://www.r-project.org/" class="uri">https://www.r-project.org/</a></p></li>
<li><p>Hyndman, R. J. et al.&nbsp;<em>forecast package documentation.</em><br>
<a href="https://pkg.robjhyndman.com/forecast/" class="uri">https://pkg.robjhyndman.com/forecast/</a></p></li>
</ul>
<hr>
</section>
<section id="suggested-next-steps-for-readers" class="level3" data-number="15.6">
<h3 data-number="15.6" class="anchored" data-anchor-id="suggested-next-steps-for-readers"><span class="header-section-number">15.6</span> Suggested next steps for readers</h3>
<p>If you want to go deeper, consider exploring:</p>
<ul>
<li>Unit root tests beyond ADF (KPSS, Phillips–Perron)</li>
<li>Structural breaks and regime changes</li>
<li>Seasonal differencing and SARIMA models</li>
<li>Volatility modeling (ARCH/GARCH)</li>
</ul>
<p>These topics build directly on the ideas discussed in this article and will deepen your understanding of time series behavior.</p>


<!-- -->

</section>
</section>

 ]]></description>
  <category>Time Series</category>
  <category>Statistical Thinking</category>
  <category>R Programming</category>
  <guid>https://mfatihtuzen.github.io/posts/2026-04-16_timeseries_stationary/</guid>
  <pubDate>Thu, 16 Apr 2026 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Data Leakage in R: Why Correct Evaluation Matters Even When Metrics Do Not Change</title>
  <dc:creator>M. Fatih Tüzen</dc:creator>
  <link>https://mfatihtuzen.github.io/posts/2026-01-22_data_leakage/</link>
  <description><![CDATA[ 






<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://mfatihtuzen.github.io/posts/2026-01-22_data_leakage/dataleakage.png" class="img-fluid quarto-figure quarto-figure-center figure-img" width="504"></p>
</figure>
</div>
<section id="introduction-why-this-topic-matters" class="level2" data-number="1">
<h2 data-number="1" class="anchored" data-anchor-id="introduction-why-this-topic-matters"><span class="header-section-number">1</span> Introduction – Why This Topic Matters</h2>
<p>A model that performs exceptionally well on a test set is not necessarily a good model; in many cases, it is a warning sign. High accuracy or low error metrics are meaningful only if we understand <strong>how</strong> they were obtained. In real-world settings, models rarely encounter data generated under the same conditions as the training phase: data arrive sequentially, delays occur, missingness patterns change, and measurement errors accumulate. Under such conditions, impressive validation metrics can quickly lose their relevance.</p>
<p>A common scenario in applied data science is deceptively familiar. During development, the model looks flawless: cross-validation results are stable, performance metrics are strong, and diagnostic plots inspire confidence. Once deployed, however, performance deteriorates—sometimes rapidly. Forecasts drift, classification decisions become unreliable, and stakeholders begin to question the entire modeling pipeline. While this failure is often attributed to distributional shift or concept drift, a more fundamental issue is frequently overlooked: <strong>the model was exposed, directly or indirectly, to information it would not have access to at prediction time</strong>.</p>
<p>This phenomenon is known as <em>data leakage</em>. Importantly, data leakage is rarely the result of an obvious coding mistake. More often, it emerges from subtle flaws in experimental design, preprocessing order, or feature construction decisions made well before the model is fitted. As a result, leakage can silently inflate performance metrics, creating models that appear robust on paper but collapse in practice.<br>
&gt; <em>“A model that performs perfectly on paper but fails miserably in practice is often a victim of data leakage.”</em></p>
<p>In this article, we examine data leakage not as a technical curiosity, but as a structural threat to valid statistical modeling. We begin by clarifying what data leakage is—and what it is not—before demonstrating, using a real dataset and R-based workflows, how seemingly reasonable preprocessing choices can contaminate model evaluation. We then reconstruct the same analysis using a leakage-free pipeline, highlighting the practical and conceptual differences through numerical results and carefully designed visualizations.</p>
</section>
<section id="what-is-data-leakage" class="level2" data-number="2">
<h2 data-number="2" class="anchored" data-anchor-id="what-is-data-leakage"><span class="header-section-number">2</span> What Is Data Leakage?</h2>
<p>At its core, <em>data leakage</em> occurs when information that would not be available at prediction time is inadvertently used during model training or evaluation. This information can enter the modeling pipeline in subtle ways—often long before a model is fitted—leading to overly optimistic performance estimates. The critical issue is not that the model “cheats,” but that the <strong>experimental setup allows future or target-related information to influence learning</strong>.</p>
<p>Formally, consider a supervised learning problem where we aim to estimate a function:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Af%20:%20%5Cmathcal%7BX%7D%20%5Crightarrow%20%5Cmathcal%7BY%7D%0A"></p>
<p>using a training set <img src="https://latex.codecogs.com/png.latex?(X_%7B%5Ctext%7Btrain%7D%7D,%20y_%7B%5Ctext%7Btrain%7D%7D)"> and evaluate it on a test set <img src="https://latex.codecogs.com/png.latex?(X_%7B%5Ctext%7Btest%7D%7D,%20y_%7B%5Ctext%7Btest%7D%7D)">. A valid evaluation assumes that <img src="https://latex.codecogs.com/png.latex?X_%7B%5Ctext%7Btest%7D%7D"> is generated independently of <img src="https://latex.codecogs.com/png.latex?y_%7B%5Ctext%7Btrain%7D%7D"> and that no function of <img src="https://latex.codecogs.com/png.latex?y_%7B%5Ctext%7Btest%7D%7D"> influences the training process. Data leakage violates this assumption by introducing a dependency—direct or indirect—between training and test information.</p>
<section id="what-data-leakage-is-not" class="level3" data-number="2.1">
<h3 data-number="2.1" class="anchored" data-anchor-id="what-data-leakage-is-not"><span class="header-section-number">2.1</span> What Data Leakage Is <em>Not</em></h3>
<p>Data leakage is often confused with other, related modeling issues. Clarifying these distinctions is essential.</p>
<ul>
<li><strong>Overfitting</strong> refers to a model learning noise or idiosyncrasies in the training data. While overfitted models generalize poorly, they do not necessarily rely on forbidden information.</li>
<li><strong>Data snooping</strong> involves repeated testing and model selection on the same validation set. This inflates performance through selection bias, but the data themselves are not structurally contaminated.</li>
<li><strong>Distribution shift</strong> (or concept drift) occurs when the data-generating process changes over time. This is a real-world phenomenon, not a methodological error.</li>
</ul>
<p>In contrast, <strong>data leakage is a violation of the temporal or logical boundary between training and prediction</strong>. It creates an artificial setting in which the model has access to information it should not logically possess.</p>
</section>
<section id="common-forms-of-data-leakage" class="level3" data-number="2.2">
<h3 data-number="2.2" class="anchored" data-anchor-id="common-forms-of-data-leakage"><span class="header-section-number">2.2</span> Common Forms of Data Leakage</h3>
<p>Data leakage can be broadly categorized into three practical forms:</p>
<ol type="1">
<li><p><strong>Target Leakage</strong><br>
Predictors encode information that is directly derived from, or strongly dependent on, the target variable. For example, constructing a feature using an outcome measured after the event being predicted.</p></li>
<li><p><strong>Train–Test Contamination</strong><br>
Information from the test set influences preprocessing steps such as scaling, imputation, or feature selection. This often happens when transformations are applied to the full dataset <em>before</em> splitting.</p></li>
<li><p><strong>Temporal Leakage</strong><br>
Future observations leak into the past, a particularly common issue in time series and forecasting contexts. Rolling averages, lag structures, or normalization computed using future data fall into this category.</p></li>
</ol>
</section>
<section id="a-simple-conceptual-example" class="level3" data-number="2.3">
<h3 data-number="2.3" class="anchored" data-anchor-id="a-simple-conceptual-example"><span class="header-section-number">2.3</span> A Simple Conceptual Example</h3>
<p>Suppose we aim to predict apartment prices using listing characteristics. If missing values in the price variable are imputed using the <em>global mean price computed over the entire dataset</em>, and the train–test split is performed afterward, then information from the test set has already influenced the training process. The model evaluation is no longer an honest simulation of future performance.</p>
<p>This type of leakage is especially dangerous because it often produces <strong>stable and impressive metrics</strong>, giving practitioners a false sense of security. The model appears reliable not because it has learned a robust relationship, but because the evaluation framework itself is compromised.</p>
<p>In the next section, we move from definitions to practice. Using a real dataset, we will deliberately construct a seemingly reasonable—but flawed—preprocessing pipeline and observe how data leakage manifests itself through inflated performance metrics.</p>
</section>
</section>
<section id="common-sources-of-data-leakage-in-practice" class="level2" data-number="3">
<h2 data-number="3" class="anchored" data-anchor-id="common-sources-of-data-leakage-in-practice"><span class="header-section-number">3</span> Common Sources of Data Leakage in Practice</h2>
<p>Data leakage rarely appears as an obvious error. In practice, it is often the result of <em>reasonable-looking preprocessing decisions</em> applied in the wrong order or under incorrect assumptions. This section outlines the most common sources of leakage encountered in applied statistical modeling and machine learning workflows, with a particular focus on preprocessing stages that precede model fitting.</p>
<section id="leakage-during-data-preprocessing" class="level3" data-number="3.1">
<h3 data-number="3.1" class="anchored" data-anchor-id="leakage-during-data-preprocessing"><span class="header-section-number">3.1</span> Leakage During Data Preprocessing</h3>
<p>One of the most frequent sources of data leakage occurs during data preprocessing. Operations such as centering, scaling, normalization, and missing-value imputation are often applied mechanically to the entire dataset before any data splitting takes place. While this approach may seem harmless, it implicitly allows information from the test set to influence transformations applied to the training data.</p>
<p>For example, consider standardization using the sample mean <img src="https://latex.codecogs.com/png.latex?%5Cmu"> and standard deviation <img src="https://latex.codecogs.com/png.latex?%5Csigma">. If these quantities are computed using the full dataset rather than the training subset alone, then statistics derived from the test data directly affect the transformed training observations. As a result, the model is evaluated in an artificially favorable setting that will never occur in real-world prediction.</p>
</section>
<section id="leakage-through-feature-engineering" class="level3" data-number="3.2">
<h3 data-number="3.2" class="anchored" data-anchor-id="leakage-through-feature-engineering"><span class="header-section-number">3.2</span> Leakage Through Feature Engineering</h3>
<p>Feature engineering is another common entry point for leakage, particularly when new variables are constructed using aggregated information. Group-level statistics—such as averages, frequencies, or ranks—can easily encode target-related information if computed without respecting the train–test boundary.</p>
<p>A typical example involves creating neighborhood-level average prices in a housing dataset. If these averages are calculated using all available observations, including those later assigned to the test set, the resulting features implicitly incorporate information from unseen data. The model appears to generalize well, but only because future information has already been embedded in the predictors.</p>
</section>
<section id="leakage-from-improper-traintest-splitting" class="level3" data-number="3.3">
<h3 data-number="3.3" class="anchored" data-anchor-id="leakage-from-improper-traintest-splitting"><span class="header-section-number">3.3</span> Leakage from Improper Train–Test Splitting</h3>
<p>In many workflows, data splitting is treated as a purely mechanical step. However, <em>when</em> and <em>how</em> the split is performed matters greatly. Random splits applied after preprocessing steps allow contamination to propagate silently. This issue is exacerbated in small or moderately sized datasets, where even minor information leakage can have a disproportionate effect on evaluation metrics.</p>
<p>The fundamental principle is simple: <strong>any operation that learns from the data must be performed exclusively on the training set</strong>. The learned transformation can then be applied to the test set—but never re-estimated using it.</p>
</section>
<section id="temporal-leakage-in-time-dependent-data" class="level3" data-number="3.4">
<h3 data-number="3.4" class="anchored" data-anchor-id="temporal-leakage-in-time-dependent-data"><span class="header-section-number">3.4</span> Temporal Leakage in Time-Dependent Data</h3>
<p>Time-dependent data introduce an additional and particularly dangerous form of leakage: temporal leakage. This occurs when future observations influence the representation of past data. Common examples include rolling statistics computed using symmetric windows, global normalization across time, or lagged features that unintentionally incorporate future values.</p>
<p>In forecasting and time series analysis, such leakage violates the chronological ordering of information. The model effectively gains access to future states of the system, leading to performance estimates that are fundamentally invalid. Unlike random contamination, temporal leakage often produces extremely smooth and stable validation results—precisely because the future is partially known.</p>
</section>
<section id="why-these-issues-are-hard-to-detect" class="level3" data-number="3.5">
<h3 data-number="3.5" class="anchored" data-anchor-id="why-these-issues-are-hard-to-detect"><span class="header-section-number">3.5</span> Why These Issues Are Hard to Detect</h3>
<p>What makes data leakage especially problematic is not its complexity, but its subtlety. Leakage-prone pipelines often run without errors, produce clean outputs, and yield impressive metrics. In many cases, the only warning sign is performance that seems <em>too consistent</em> or <em>too good to be true</em>.</p>
<p>Crucially, standard validation techniques cannot detect leakage if the underlying data-generating assumptions have already been violated. Once contamination occurs, even rigorous cross-validation merely reinforces a flawed evaluation framework.</p>
<p>In the next section, we will make these ideas concrete by constructing a deliberately flawed preprocessing pipeline using a real dataset. By examining the resulting performance metrics and visual diagnostics, we will observe how data leakage manifests itself in practice.</p>
</section>
</section>
<section id="dataset-description-airbnb-listings-data" class="level2" data-number="4">
<h2 data-number="4" class="anchored" data-anchor-id="dataset-description-airbnb-listings-data"><span class="header-section-number">4</span> Dataset Description: Airbnb Listings Data</h2>
<p>To demonstrate how data leakage arises in practice, we use a real-world dataset derived from Airbnb listings. The dataset is obtained from the publicly available <em>Inside Airbnb</em> project, which provides detailed, regularly updated information on short-term rental listings for major cities worldwide. In this study, we focus on the Istanbul listings, which offer a rich combination of numerical and categorical variables and exhibit common data quality issues encountered in applied modeling tasks.</p>
<p>The Inside Airbnb project aims to support research, policy analysis, and public discussion by making scraped Airbnb data openly accessible. The dataset includes listing-level attributes such as pricing information, accommodation characteristics, host-related variables, and geographic identifiers. Due to its size, heterogeneity, and real-world imperfections, it provides an ideal setting for illustrating preprocessing pitfalls and evaluation errors.</p>
<section id="data-source" class="level3" data-number="4.1">
<h3 data-number="4.1" class="anchored" data-anchor-id="data-source"><span class="header-section-number">4.1</span> Data Source</h3>
<p>The data are publicly available at:</p>
<blockquote class="blockquote">
<p><a href="https://insideairbnb.com/get-the-data/" class="uri">https://insideairbnb.com/get-the-data/</a></p>
</blockquote>
<p>For reproducibility, the analysis in this article uses a snapshot of the Istanbul listings dataset downloaded directly from the source. While the exact number of observations may vary across releases, the structure and modeling challenges remain consistent across versions.</p>
</section>
<section id="target-variable-and-modeling-objective" class="level3" data-number="4.2">
<h3 data-number="4.2" class="anchored" data-anchor-id="target-variable-and-modeling-objective"><span class="header-section-number">4.2</span> Target Variable and Modeling Objective</h3>
<p>Our primary modeling objective is to predict the <strong>listing price</strong> based on observable characteristics of the property and its location. The target variable, denoted by <img src="https://latex.codecogs.com/png.latex?y">, corresponds to the nightly price of a listing in local currency units.</p>
<p>Price prediction in short-term rental data is a well-studied problem and serves as a natural example for illustrating data leakage. Importantly, price exhibits:</p>
<ul>
<li>strong right skewness,</li>
<li>substantial heterogeneity across neighborhoods,</li>
<li>sensitivity to aggregation and preprocessing choices.</li>
</ul>
<p>These properties make the variable particularly vulnerable to leakage through global transformations and improperly constructed features.</p>
</section>
<section id="predictor-variables" class="level3" data-number="4.3">
<h3 data-number="4.3" class="anchored" data-anchor-id="predictor-variables"><span class="header-section-number">4.3</span> Predictor Variables</h3>
<p>The predictor set includes a mix of numerical and categorical variables commonly used in pricing models, such as:</p>
<ul>
<li>accommodation capacity (e.g., number of guests),</li>
<li>room type and property type,</li>
<li>neighborhood identifiers,</li>
<li>availability-related measures,</li>
<li>host characteristics.</li>
</ul>
<p>Several variables contain missing values, and many exhibit heavy-tailed distributions. These features necessitate preprocessing steps such as imputation, scaling, and transformation—precisely the stages where data leakage most often occurs.</p>
</section>
<section id="why-this-dataset-is-suitable-for-studying-data-leakage" class="level3" data-number="4.4">
<h3 data-number="4.4" class="anchored" data-anchor-id="why-this-dataset-is-suitable-for-studying-data-leakage"><span class="header-section-number">4.4</span> Why This Dataset Is Suitable for Studying Data Leakage</h3>
<p>This dataset is especially well-suited for examining data leakage for three reasons. First, it requires nontrivial preprocessing to be usable for modeling, increasing the risk of incorrect transformation order. Second, it includes categorical groupings (such as neighborhoods) that invite aggregation-based feature engineering, a common source of target leakage. Third, its real-world origin ensures that modeling assumptions—such as stationarity, completeness, and clean measurement—are only approximately satisfied.</p>
<p>By working with this dataset, we intentionally place ourselves in a realistic applied setting, where leakage is not an abstract concept but a tangible risk. In the next section, we construct a seemingly reasonable preprocessing pipeline that violates key evaluation principles, allowing us to observe how data leakage inflates model performance in practice.</p>
</section>
</section>
<section id="a-naive-preprocessing-pipeline-and-why-it-is-wrong" class="level2" data-number="5">
<h2 data-number="5" class="anchored" data-anchor-id="a-naive-preprocessing-pipeline-and-why-it-is-wrong"><span class="header-section-number">5</span> A Naive Preprocessing Pipeline (And Why It Is Wrong)</h2>
<p>At first glance, many preprocessing pipelines appear perfectly reasonable. Data are cleaned, missing values are handled, variables are scaled, and only then is the dataset split into training and test sets. This workflow is intuitive, easy to implement, and—most importantly—widely used. Unfortunately, it is also fundamentally flawed.</p>
<p>In this section, we deliberately construct such a <em>naive pipeline</em> to illustrate how data leakage can arise without any obvious warning signs.</p>
<section id="step-1-loading-and-preparing-the-data" class="level3" data-number="5.1">
<h3 data-number="5.1" class="anchored" data-anchor-id="step-1-loading-and-preparing-the-data"><span class="header-section-number">5.1</span> Step 1: Loading and Preparing the Data</h3>
<p>We begin by loading the Airbnb listings data and selecting a subset of variables commonly used for price prediction. For simplicity, we focus on numerical predictors that require minimal encoding.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(tidyverse)</span>
<span id="cb1-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(rsample)</span>
<span id="cb1-3"></span>
<span id="cb1-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Load data (example assumes listings.csv from Inside Airbnb)</span></span>
<span id="cb1-5">airbnb <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">read_csv</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"listings.csv"</span>)</span>
<span id="cb1-6"></span>
<span id="cb1-7">airbnb_model <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> airbnb <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb1-8">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">select</span>(</span>
<span id="cb1-9">    price,</span>
<span id="cb1-10">    accommodates,</span>
<span id="cb1-11">    bedrooms,</span>
<span id="cb1-12">    bathrooms,</span>
<span id="cb1-13">    minimum_nights,</span>
<span id="cb1-14">    availability_365</span>
<span id="cb1-15">  ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb1-16">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(</span>
<span id="cb1-17">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">price =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">as.numeric</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">str_remove_all</span>(price, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"[$,]"</span>))</span>
<span id="cb1-18">  )</span></code></pre></div></div>
</div>
<p>At this stage, the dataset already contains missing values and variables with highly skewed distributions—a realistic and unavoidable situation in applied work.</p>
</section>
<section id="step-2-global-preprocessing-the-critical-mistake" class="level3" data-number="5.2">
<h3 data-number="5.2" class="anchored" data-anchor-id="step-2-global-preprocessing-the-critical-mistake"><span class="header-section-number">5.2</span> Step 2: Global Preprocessing (The Critical Mistake)</h3>
<p>A common approach is to perform preprocessing <em>once</em> on the full dataset. Below, we impute missing values using the global mean and standardize all predictors using statistics computed from the entire dataset.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb2-1">airbnb_preprocessed <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> airbnb_model <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb2-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">across</span>(</span>
<span id="cb2-3">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">.cols =</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>price,</span>
<span id="cb2-4">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">.fns  =</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ifelse</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(.x), <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(.x, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">na.rm =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>), .x)</span>
<span id="cb2-5">  )) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb2-6">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">across</span>(</span>
<span id="cb2-7">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">.cols =</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>price,</span>
<span id="cb2-8">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">.fns  =</span> scale</span>
<span id="cb2-9">  ))</span></code></pre></div></div>
</div>
<p>From a purely technical perspective, this code runs without errors and produces clean, well-behaved predictors. However, the preprocessing steps above implicitly use information from <em>all observations</em>, including those that will later be assigned to the test set.</p>
<p>At this point, data leakage has already occurred.</p>
</section>
<section id="step-3-traintest-split-after-preprocessing" class="level3" data-number="5.3">
<h3 data-number="5.3" class="anchored" data-anchor-id="step-3-traintest-split-after-preprocessing"><span class="header-section-number">5.3</span> Step 3: Train–Test Split After Preprocessing</h3>
<p>Next, we perform a random split of the preprocessed data into training and test sets.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb3-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">set.seed</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">123</span>)</span>
<span id="cb3-2"></span>
<span id="cb3-3">split <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">initial_split</span>(airbnb_preprocessed, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">prop =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.8</span>)</span>
<span id="cb3-4">train_data <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">training</span>(split)</span>
<span id="cb3-5">test_data  <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">testing</span>(split)</span></code></pre></div></div>
</div>
<p>Because the split is applied <em>after</em> preprocessing, the training data have been standardized and imputed using statistics influenced by the test data. The train–test boundary, while present in code, has already been violated in substance.</p>
</section>
<section id="step-4-model-fitting-and-evaluation" class="level3" data-number="5.4">
<h3 data-number="5.4" class="anchored" data-anchor-id="step-4-model-fitting-and-evaluation"><span class="header-section-number">5.4</span> Step 4: Model Fitting and Evaluation</h3>
<p>We now fit a simple linear regression model using the training data and evaluate its predictive performance on the test set. At this stage, the goal is not to build an optimal model, but to assess how the <em>evaluation framework itself</em> can be compromised by data leakage.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb4-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Fit a linear regression model on the training data</span></span>
<span id="cb4-2">model_naive <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lm</span>(price <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> ., <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data =</span> train_data)</span>
<span id="cb4-3"></span>
<span id="cb4-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Generate predictions for the test set</span></span>
<span id="cb4-5">pred_test <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">predict</span>(model_naive, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">newdata =</span> test_data)</span></code></pre></div></div>
</div>
<p>To compute a supervised performance metric, we must restrict the evaluation to test observations for which the target variable is observed. Listings with missing prices cannot contribute to an error metric such as RMSE, as no ground truth is available.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb5-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Create an evaluation dataset with observed targets only</span></span>
<span id="cb5-2">eval_df <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> test_data <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb5-3">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">transmute</span>(</span>
<span id="cb5-4">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">price =</span> price,</span>
<span id="cb5-5">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">pred  =</span> pred_test</span>
<span id="cb5-6">  ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb5-7">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(price), <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(pred))</span>
<span id="cb5-8"></span>
<span id="cb5-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Root Mean Squared Error</span></span>
<span id="cb5-10">rmse_naive <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sqrt</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>((eval_df<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>price <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> eval_df<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>pred)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">^</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>))</span>
<span id="cb5-11">rmse_naive</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 21213.69</code></pre>
</div>
</div>
<p>The computed RMSE provides a single-point estimate of out-of-sample error under this evaluation setup. However, the absolute magnitude of this value is difficult to interpret in isolation because it depends on the scale and distribution of the target variable (price). More importantly for this article, the key concern is methodological: preprocessing steps were estimated using the full dataset before splitting, which compromises the train–test separation and can lead to overly optimistic performance estimates.</p>
<p>In the next section, we will evaluate this suspicion more systematically by repeating the procedure across multiple random splits and inspecting the distribution of performance metrics.</p>
</section>
</section>
<section id="detecting-data-leakage-repeated-splits-and-performance-distributions" class="level2" data-number="6">
<h2 data-number="6" class="anchored" data-anchor-id="detecting-data-leakage-repeated-splits-and-performance-distributions"><span class="header-section-number">6</span> Detecting Data Leakage: Repeated Splits and Performance Distributions</h2>
<p>A single train–test split provides only a point estimate of model performance. To assess whether the suspiciously favorable evaluation observed earlier is a coincidence or a structural issue, we repeat the naive preprocessing and evaluation procedure across multiple random splits of the data. This allows us to examine the <em>distribution</em> of performance metrics rather than relying on a single value.</p>
<section id="repeated-evaluation-under-the-naive-pipeline" class="level3" data-number="6.1">
<h3 data-number="6.1" class="anchored" data-anchor-id="repeated-evaluation-under-the-naive-pipeline"><span class="header-section-number">6.1</span> Repeated Evaluation Under the Naive Pipeline</h3>
<p>We repeat the following steps multiple times: 1. Randomly split the data into training and test sets. 2. Fit the model on the training data. 3. Compute RMSE on the test data using observed targets only.</p>
<p>Crucially, <strong>the same flawed preprocessing pipeline is retained</strong>, meaning that scaling and imputation are still performed on the full dataset prior to splitting.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb7-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">set.seed</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">123</span>)</span>
<span id="cb7-2"></span>
<span id="cb7-3">n_repeats <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">30</span></span>
<span id="cb7-4">rmse_values <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">numeric</span>(n_repeats)</span>
<span id="cb7-5"></span>
<span id="cb7-6"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> (i <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">seq_len</span>(n_repeats)) {</span>
<span id="cb7-7">  </span>
<span id="cb7-8">  split_i <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">initial_split</span>(airbnb_preprocessed, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">prop =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.8</span>)</span>
<span id="cb7-9">  train_i <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">training</span>(split_i)</span>
<span id="cb7-10">  test_i  <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">testing</span>(split_i)</span>
<span id="cb7-11">  </span>
<span id="cb7-12">  model_i <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lm</span>(price <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> ., <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data =</span> train_i)</span>
<span id="cb7-13">  pred_i  <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">predict</span>(model_i, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">newdata =</span> test_i)</span>
<span id="cb7-14">  </span>
<span id="cb7-15">  eval_i <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">tibble</span>(</span>
<span id="cb7-16">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">price =</span> test_i<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>price,</span>
<span id="cb7-17">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">pred  =</span> pred_i</span>
<span id="cb7-18">  ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb7-19">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(price), <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(pred))</span>
<span id="cb7-20">  </span>
<span id="cb7-21">  rmse_values[i] <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sqrt</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>((eval_i<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>price <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> eval_i<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>pred)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">^</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>))</span>
<span id="cb7-22">}</span>
<span id="cb7-23"></span>
<span id="cb7-24">rmse_df <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">tibble</span>(</span>
<span id="cb7-25">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">iteration =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">seq_len</span>(n_repeats),</span>
<span id="cb7-26">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">rmse      =</span> rmse_values</span>
<span id="cb7-27">)</span></code></pre></div></div>
</div>
</section>
<section id="inspecting-the-rmse-distribution" class="level3" data-number="6.2">
<h3 data-number="6.2" class="anchored" data-anchor-id="inspecting-the-rmse-distribution"><span class="header-section-number">6.2</span> Inspecting the RMSE Distribution</h3>
<p>Rather than focusing on individual values, we now inspect the distribution of RMSE across repeated splits.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb8-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(ggplot2)</span>
<span id="cb8-2"></span>
<span id="cb8-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggplot</span>(rmse_df, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> rmse)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb8-4">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_histogram</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">bins =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">15</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">fill =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#4C72B0"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">color =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"white"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb8-5">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_vline</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">xintercept =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(rmse_df<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>rmse), </span>
<span id="cb8-6">             <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">linetype =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"dashed"</span>, </span>
<span id="cb8-7">             <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">linewidth =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb8-8">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">labs</span>(</span>
<span id="cb8-9">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">title =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"RMSE Distribution Under Naive Preprocessing"</span>,</span>
<span id="cb8-10">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">subtitle =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Repeated random train–test splits"</span>,</span>
<span id="cb8-11">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"RMSE"</span>,</span>
<span id="cb8-12">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Count"</span></span>
<span id="cb8-13">  ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb8-14">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme_minimal</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">base_size =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">12</span>)</span></code></pre></div></div>
<div class="cell-output-display">
<div>
<figure class="figure">
<p><img src="https://mfatihtuzen.github.io/posts/2026-01-22_data_leakage/index_files/figure-html/unnamed-chunk-7-1.png" class="img-fluid figure-img" width="672"></p>
</figure>
</div>
</div>
</div>
</section>
<section id="interpretation" class="level3" data-number="6.3">
<h3 data-number="6.3" class="anchored" data-anchor-id="interpretation"><span class="header-section-number">6.3</span> Interpretation</h3>
<p>The RMSE values obtained across repeated random splits exhibit substantial variability, spanning a wide range rather than concentrating around a narrow interval. This degree of dispersion reflects the heterogeneity of the data and the sensitivity of the model to different training–test partitions.</p>
<p>Importantly, this result highlights a key limitation of relying on a single train–test split: performance estimates can vary dramatically depending on how the data are partitioned. At this stage, the variability itself does not constitute evidence of data leakage. Instead, it establishes a baseline level of uncertainty against which alternative preprocessing strategies must be evaluated.</p>
<p>In the following section, we will repeat the same experiment using a leakage-free preprocessing pipeline. By comparing the resulting RMSE distributions, we can assess whether improper preprocessing leads to systematically optimistic or distorted performance estimates.</p>
</section>
</section>
<section id="a-leakage-free-preprocessing-pipeline" class="level2" data-number="7">
<h2 data-number="7" class="anchored" data-anchor-id="a-leakage-free-preprocessing-pipeline"><span class="header-section-number">7</span> A Leakage-Free Preprocessing Pipeline</h2>
<p>To assess whether the previously observed behavior is driven by improper preprocessing, we now reconstruct the entire workflow using a leakage-free pipeline. The key principle is simple but fundamental: <strong>any transformation that learns from the data must be estimated using the training set only and then applied to the test set without re-estimation</strong>.</p>
<section id="correct-order-of-operations" class="level3" data-number="7.1">
<h3 data-number="7.1" class="anchored" data-anchor-id="correct-order-of-operations"><span class="header-section-number">7.1</span> Correct Order of Operations</h3>
<p>The leakage-free workflow follows this sequence:</p>
<ol type="1">
<li>Split the data into training and test sets.</li>
<li>Estimate preprocessing parameters using the training data only.</li>
<li>Apply the learned transformations to both training and test sets.</li>
<li>Fit the model on the transformed training data.</li>
<li>Evaluate performance on the transformed test data.</li>
</ol>
<p>This ordering mirrors real-world deployment, where future observations arrive without access to global dataset statistics.</p>
</section>
<section id="implementing-leakage-free-preprocessing-in-r" class="level3" data-number="7.2">
<h3 data-number="7.2" class="anchored" data-anchor-id="implementing-leakage-free-preprocessing-in-r"><span class="header-section-number">7.2</span> Implementing Leakage-Free Preprocessing in R</h3>
<p>We begin by repeating the evaluation procedure across multiple random splits, as in the previous section. This time, however, preprocessing steps are learned exclusively from the training data.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb9-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">set.seed</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">123</span>)</span>
<span id="cb9-2"></span>
<span id="cb9-3">n_repeats <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">30</span></span>
<span id="cb9-4">rmse_correct <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">numeric</span>(n_repeats)</span>
<span id="cb9-5"></span>
<span id="cb9-6"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> (i <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">seq_len</span>(n_repeats)) {</span>
<span id="cb9-7">  </span>
<span id="cb9-8">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Split first</span></span>
<span id="cb9-9">  split_i <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">initial_split</span>(airbnb_model, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">prop =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.8</span>)</span>
<span id="cb9-10">  train_raw <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">training</span>(split_i)</span>
<span id="cb9-11">  test_raw  <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">testing</span>(split_i)</span>
<span id="cb9-12">  </span>
<span id="cb9-13">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Estimate preprocessing parameters on training data only</span></span>
<span id="cb9-14">  train_processed <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> train_raw <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb9-15">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">across</span>(</span>
<span id="cb9-16">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">.cols =</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>price,</span>
<span id="cb9-17">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">.fns  =</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ifelse</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(.x), <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(.x, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">na.rm =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>), .x)</span>
<span id="cb9-18">    ))</span>
<span id="cb9-19">  </span>
<span id="cb9-20">  scaling_params <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> train_processed <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb9-21">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summarise</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">across</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>price, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">list</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">mean =</span> mean, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sd =</span> sd), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">na.rm =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>))</span>
<span id="cb9-22">  </span>
<span id="cb9-23">  scale_train <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(x, m, s) {</span>
<span id="cb9-24">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ifelse</span>(s <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, (x <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> m) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> s, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)</span>
<span id="cb9-25">  }</span>
<span id="cb9-26">  </span>
<span id="cb9-27">  <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> (v <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">names</span>(train_processed)[<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">names</span>(train_processed) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"price"</span>]) {</span>
<span id="cb9-28">    m <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> scaling_params[[<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">paste0</span>(v, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"_mean"</span>)]]</span>
<span id="cb9-29">    s <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> scaling_params[[<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">paste0</span>(v, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"_sd"</span>)]]</span>
<span id="cb9-30">    </span>
<span id="cb9-31">    train_processed[[v]] <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">scale_train</span>(train_processed[[v]], m, s)</span>
<span id="cb9-32">  }</span>
<span id="cb9-33">  </span>
<span id="cb9-34">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Apply the same transformations to the test set</span></span>
<span id="cb9-35">  test_processed <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> test_raw <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb9-36">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">across</span>(</span>
<span id="cb9-37">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">.cols =</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>price,</span>
<span id="cb9-38">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">.fns  =</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ifelse</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(.x), <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(train_raw[[<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">cur_column</span>()]], <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">na.rm =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>), .x)</span>
<span id="cb9-39">    ))</span>
<span id="cb9-40">  </span>
<span id="cb9-41">  <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> (v <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">names</span>(test_processed)[<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">names</span>(test_processed) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"price"</span>]) {</span>
<span id="cb9-42">    m <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> scaling_params[[<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">paste0</span>(v, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"_mean"</span>)]]</span>
<span id="cb9-43">    s <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> scaling_params[[<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">paste0</span>(v, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"_sd"</span>)]]</span>
<span id="cb9-44">    </span>
<span id="cb9-45">    test_processed[[v]] <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">scale_train</span>(test_processed[[v]], m, s)</span>
<span id="cb9-46">  }</span>
<span id="cb9-47">  </span>
<span id="cb9-48">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Fit model</span></span>
<span id="cb9-49">  model_i <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lm</span>(price <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> ., <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data =</span> train_processed)</span>
<span id="cb9-50">  pred_i  <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">predict</span>(model_i, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">newdata =</span> test_processed)</span>
<span id="cb9-51">  </span>
<span id="cb9-52">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Evaluate where target is observed</span></span>
<span id="cb9-53">  eval_i <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">tibble</span>(</span>
<span id="cb9-54">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">price =</span> test_processed<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>price,</span>
<span id="cb9-55">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">pred  =</span> pred_i</span>
<span id="cb9-56">  ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb9-57">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(price), <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(pred))</span>
<span id="cb9-58">  </span>
<span id="cb9-59">  rmse_correct[i] <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sqrt</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>((eval_i<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>price <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> eval_i<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>pred)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">^</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>))</span>
<span id="cb9-60">}</span>
<span id="cb9-61"></span>
<span id="cb9-62">rmse_correct_df <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">tibble</span>(</span>
<span id="cb9-63">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">iteration =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">seq_len</span>(n_repeats),</span>
<span id="cb9-64">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">rmse      =</span> rmse_correct</span>
<span id="cb9-65">)</span></code></pre></div></div>
</div>
</section>
<section id="comparing-performance-distributions" class="level3" data-number="7.3">
<h3 data-number="7.3" class="anchored" data-anchor-id="comparing-performance-distributions"><span class="header-section-number">7.3</span> Comparing Performance Distributions</h3>
<p>We now compare RMSE distributions obtained under the naive and leakage-free preprocessing pipelines.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb10" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb10-1">rmse_compare <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">bind_rows</span>(</span>
<span id="cb10-2">  rmse_df <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">pipeline =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Naive preprocessing"</span>),</span>
<span id="cb10-3">  rmse_correct_df <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">pipeline =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Leakage-free preprocessing"</span>)</span>
<span id="cb10-4">)</span>
<span id="cb10-5"></span>
<span id="cb10-6"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggplot</span>(rmse_compare, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> rmse, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">fill =</span> pipeline)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb10-7">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_histogram</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">position =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"identity"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">alpha =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.6</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">bins =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">15</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb10-8">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">labs</span>(</span>
<span id="cb10-9">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">title =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"RMSE Distributions Under Different Preprocessing Pipelines"</span>,</span>
<span id="cb10-10">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">subtitle =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Naive vs leakage-free evaluation"</span>,</span>
<span id="cb10-11">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"RMSE"</span>,</span>
<span id="cb10-12">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Count"</span></span>
<span id="cb10-13">  ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb10-14">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme_minimal</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">base_size =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">12</span>)</span></code></pre></div></div>
<div class="cell-output-display">
<div>
<figure class="figure">
<p><img src="https://mfatihtuzen.github.io/posts/2026-01-22_data_leakage/index_files/figure-html/unnamed-chunk-9-1.png" class="img-fluid figure-img" width="672"></p>
</figure>
</div>
</div>
</div>
</section>
<section id="interpretation-1" class="level3" data-number="7.4">
<h3 data-number="7.4" class="anchored" data-anchor-id="interpretation-1"><span class="header-section-number">7.4</span> Interpretation</h3>
<p>The RMSE distributions obtained under the naive and leakage-free preprocessing pipelines are nearly indistinguishable. Across repeated random splits, both approaches yield similar ranges, central tendencies, and tail behavior. Visually, the two histograms largely overlap, causing the leakage-free distribution to be obscured in the combined plot; this overlap itself reflects the near-identical numerical behavior of the two pipelines under the present modeling setup.</p>
<p>This result demonstrates an important but often overlooked point: data leakage does not always lead to dramatic or easily detectable performance inflation. In some settings—particularly with simple models and highly variable targets—the numerical impact of leakage may be minimal, even though the evaluation procedure remains theoretically flawed.</p>
<p>Crucially, the absence of a visible performance gap does not validate the naive pipeline. Instead, it highlights the need to assess preprocessing decisions based on methodological correctness rather than empirical convenience. In other contexts, datasets, or modeling frameworks, the same mistake could lead to substantial and misleading performance gains.</p>
</section>
</section>
<section id="conclusion" class="level2" data-number="8">
<h2 data-number="8" class="anchored" data-anchor-id="conclusion"><span class="header-section-number">8</span> Conclusion</h2>
<p>This article set out with a seemingly straightforward question: can data leakage lead to misleadingly strong model performance? The empirical results presented here suggest a more nuanced answer. In the examined setting—using a simple linear model and a highly heterogeneous real-world dataset—improper preprocessing did not result in dramatic or easily detectable performance inflation. Naive and leakage-free pipelines produced nearly identical error distributions.</p>
<p>However, this outcome does not diminish the importance of data leakage. On the contrary, it highlights its most insidious characteristic: <strong>data leakage is dangerous precisely because it does not always announce itself through obvious performance gains</strong>. Evaluation metrics may remain unchanged, stable, or even reasonable, while the underlying logic of the evaluation has already been violated.</p>
<p>The central lesson is therefore not about performance optimization, but about validity. Correct model evaluation is a matter of respecting information boundaries—temporal, logical, and structural—regardless of whether immediate numerical consequences are visible. Relying on empirically convenient shortcuts simply because they “seem to work” risks building pipelines that fail silently when transferred to new data, different models, or operational settings.</p>
<p>Ultimately, data leakage should be treated as a methodological error, not a performance issue. Thinking carefully about preprocessing order, information flow, and evaluation design is not optional; it is a prerequisite for trustworthy statistical modeling.</p>
</section>
<section id="references" class="level2" data-number="9">
<h2 data-number="9" class="anchored" data-anchor-id="references"><span class="header-section-number">9</span> References</h2>
<ul>
<li><p>Hastie, T., Tibshirani, R., &amp; Friedman, J. (2009). <em>The Elements of Statistical Learning: Data Mining, Inference, and Prediction</em> (2nd ed.). Springer.<br>
<a href="https://doi.org/10.1007/978-0-387-84858-7" class="uri">https://doi.org/10.1007/978-0-387-84858-7</a></p></li>
<li><p>Kuhn, M., &amp; Johnson, K. (2013). <em>Applied Predictive Modeling</em>. Springer.<br>
<a href="https://doi.org/10.1007/978-1-4614-6849-3" class="uri">https://doi.org/10.1007/978-1-4614-6849-3</a></p></li>
<li><p>Kuhn, M., &amp; Silge, J. (2022). <em>Tidy Modeling with R</em>. O’Reilly Media.<br>
<a href="https://www.tmwr.org/" class="uri">https://www.tmwr.org/</a></p></li>
<li><p>Scikit-learn documentation. (n.d.). <em>Common pitfalls in machine learning</em>.<br>
<a href="https://scikit-learn.org/stable/common_pitfalls.html" class="uri">https://scikit-learn.org/stable/common_pitfalls.html</a></p></li>
<li><p>Inside Airbnb. (n.d.). <em>Get the data</em>.<br>
<a href="https://insideairbnb.com/get-the-data/" class="uri">https://insideairbnb.com/get-the-data/</a></p></li>
</ul>


<!-- -->

</section>

 ]]></description>
  <category>Data Leakage</category>
  <category>R Programming</category>
  <category>Data Preprocessing</category>
  <category>Model Evaluation</category>
  <category>Statistical Learning</category>
  <category>Machine Learning Pitfalls</category>
  <category>Reproducible Research</category>
  <guid>https://mfatihtuzen.github.io/posts/2026-01-22_data_leakage/</guid>
  <pubDate>Thu, 22 Jan 2026 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Data Normalization in R: When, Why, and How to Scale Your Data Correctly</title>
  <dc:creator>M. Fatih Tüzen</dc:creator>
  <link>https://mfatihtuzen.github.io/posts/2026-01-02_normalization/</link>
  <description><![CDATA[ 






<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://mfatihtuzen.github.io/posts/2026-01-02_normalization/normalization.png" class="img-fluid quarto-figure quarto-figure-center figure-img" width="491"></p>
</figure>
</div>
<section id="introduction" class="level2" data-number="1">
<h2 data-number="1" class="anchored" data-anchor-id="introduction"><span class="header-section-number">1</span> Introduction</h2>
<p>This article is part of a broader series on <strong>data preprocessing in R</strong>. In earlier posts, we focused on two problems that quietly ruin analyses long before modeling begins: <strong>missing data</strong> and <strong>outliers</strong>. Both topics shared a common theme: preprocessing choices are not cosmetic; they change what the model is allowed to learn. In this installment, we move to the next decision point in the same pipeline: <strong>normalization (scaling)</strong>—often treated as “just a quick step,” but in practice a decisive modeling choice.</p>
<blockquote class="blockquote">
<p><strong>Related posts in this preprocessing series</strong></p>
<ul>
<li><p><em>Handling Missing Data in R: A Comprehensive Guide</em><br>
<a href="https://medium.com/r-evolution/handling-missing-data-in-r-a-comprehensive-guide-eca195eaead3" class="uri">https://medium.com/r-evolution/handling-missing-data-in-r-a-comprehensive-guide-eca195eaead3</a></p></li>
<li><p><em>Outliers in Data Analysis: Detecting Extreme Values Before Modeling in R</em><br>
<a href="https://medium.com/r-evolution/outliers-in-data-analysis-detecting-extreme-values-before-modeling-in-r-with-i%CC%87stanbul-airbnb-data-3b37e9ee989e" class="uri">https://medium.com/r-evolution/outliers-in-data-analysis-detecting-extreme-values-before-modeling-in-r-with-i%CC%87stanbul-airbnb-data-3b37e9ee989e</a></p></li>
</ul>
</blockquote>
<p>Normalization (or more broadly, scaling) is frequently presented as a minor technical adjustment—something to apply quickly and forget. In practice, scaling is not a technical detail but a modeling decision. When the same dataset is processed using different scaling strategies, the behavior of many models changes substantially. Distances, similarity measures, penalty terms, and optimization paths are all affected. As a result, the nearest neighbors selected by KNN, the clusters formed by K-means, the principal components identified by PCA, and even the coefficients chosen by Ridge or Lasso regression can differ. Scaling does not merely “prepare” the data; it actively shapes how a model interprets importance and structure.</p>
<p>More importantly, scaling is not universally beneficial. Applied in the wrong context, it can degrade model performance or—worse—introduce subtle forms of <strong>data leakage</strong> that contaminate evaluation. A common example is learning scaling parameters (such as means and standard deviations) from the entire dataset before splitting into training and test sets. This procedure allows information from the test distribution to leak into the training process, producing performance estimates that cannot be trusted. In such cases, the issue is not the scaling method itself, but <strong>when and how</strong> it is applied. Knowing how to call <code>scale()</code> in R is trivial; understanding what to scale, when to scale it, and why is not.</p>
<p>In this article, normalization is treated as an integral part of the modeling strategy rather than a routine preprocessing step. We will address, step by step, the following questions:</p>
<ul>
<li>Why is normalization necessary?</li>
<li>Should it always be applied?</li>
<li>At what stage should it be performed—before or after the train–test split?</li>
<li>Which scaling methods are commonly used, and in which contexts do they make sense?</li>
<li>Should different data types be treated differently?</li>
<li>Is scaling appropriate for all variables, including the target variable?</li>
</ul>
<p>By combining conceptual discussion with practical R implementations, this guide aims to provide clear and principled answers to each of these questions.</p>
</section>
<section id="normalization-vs.-standardization-clearing-up-the-terminology" class="level2" data-number="2">
<h2 data-number="2" class="anchored" data-anchor-id="normalization-vs.-standardization-clearing-up-the-terminology"><span class="header-section-number">2</span> Normalization vs.&nbsp;Standardization: Clearing Up the Terminology</h2>
<p>In both academic writing and everyday practice, the terms <em>normalization</em> and <em>standardization</em> are frequently used interchangeably. This loose usage is one of the main sources of confusion in data preprocessing. In reality, these terms refer to <strong>different scaling strategies</strong>, each with distinct assumptions, effects, and use cases. Before discussing when and how scaling should be applied, it is therefore essential to clarify what is actually meant by each approach.</p>
<p><strong>Standardization</strong>, often referred to as <em>z-score scaling</em>, rescales a variable so that it has a mean of zero and a standard deviation of one. Formally, each observation is transformed by subtracting the sample mean and dividing by the sample standard deviation. In the R ecosystem, this logic is implemented in preprocessing tools such as <code>step_normalize()</code> from the <strong>recipes</strong> package. Standardization preserves the shape of the original distribution while putting variables on a comparable scale. It is particularly useful for models that are sensitive to the relative magnitude of predictors, such as linear models with regularization, support vector machines, and neural networks.</p>
<p><strong>Normalization</strong>, in a stricter sense, often refers to <em>min–max scaling</em>. This approach rescales variables to lie within a fixed interval, most commonly [0,1]. Each value is transformed based on the minimum and maximum observed in the training data. Min–max scaling is easy to interpret and is frequently used in algorithms where bounded inputs are desirable. However, it is also more sensitive to extreme values, since a single outlier can heavily influence the scaling range.</p>
<p>A third commonly used approach is <strong>robust scaling</strong>, which relies on the median and the interquartile range (IQR) instead of the mean and standard deviation. By construction, this method is less affected by outliers and heavy-tailed distributions. Robust scaling is especially useful in real-world datasets where extreme values are not errors but genuine observations. At the same time, it is not a universal solution; in some data structures, robust measures may become unstable or uninformative.</p>
<p>The reason terminology becomes blurred in practice is simple: many practitioners use the word <em>normalization</em> as a generic label for “any kind of scaling.” As a result, two people may both say they normalized their data while having applied entirely different transformations. Throughout this article, we will avoid this ambiguity by explicitly stating which scaling method is used and why. This distinction is not pedantic—it is essential for understanding how scaling choices influence model behavior.</p>
</section>
<section id="why-is-normalization-necessary" class="level2" data-number="3">
<h2 data-number="3" class="anchored" data-anchor-id="why-is-normalization-necessary"><span class="header-section-number">3</span> Why Is Normalization Necessary?</h2>
<p>The necessity of normalization becomes clear once we recognize that many modeling techniques do not operate on raw variable values directly, but on <strong>relationships derived from them</strong>—such as distances, similarities, penalties, or variance directions. When predictors are measured on different scales, these derived quantities can be dominated by variables with larger numerical ranges, regardless of their substantive importance. In such cases, the model does not learn from the data structure itself, but from arbitrary measurement units.</p>
<p>This issue is most apparent in <strong>distance-based methods</strong> such as k-nearest neighbors (KNN) and K-means clustering. These algorithms rely explicitly on distance calculations, typically Euclidean distance. If one variable ranges between 0 and 1 while another ranges between 0 and 10,000, the latter will dominate the distance computation almost entirely. As a result, proximity is determined not by overall similarity but by the scale of a single variable. Normalization ensures that each predictor contributes to the distance metric in a balanced and interpretable way, allowing the algorithm to reflect genuine similarity rather than numerical magnitude.</p>
<p>Normalization is equally critical in models that incorporate <strong>regularization</strong>, such as Ridge and Lasso regression. In these models, coefficients are penalized to control model complexity. However, the penalty term is directly tied to the scale of the predictors. If variables are not on comparable scales, the regularization mechanism will shrink coefficients unevenly, effectively penalizing some predictors more than others for reasons unrelated to their predictive relevance. Scaling aligns the predictors so that regularization operates as intended: as a constraint on model complexity rather than an artifact of measurement units.</p>
<p>Other widely used techniques—including <strong>support vector machines (SVMs)</strong>, <strong>neural networks</strong>, and <strong>principal component analysis (PCA)</strong>—are also highly sensitive to scaling. In SVMs and neural networks, optimization procedures depend on gradients that are influenced by feature magnitudes, affecting both convergence speed and stability. In PCA, the directions of maximum variance are determined by the scale of the variables; without normalization, components may simply reflect variables with the largest variances rather than the most informative underlying structure. In all these cases, scaling is not an optional refinement but a prerequisite for meaningful model behavior.</p>
<p>By contrast, <strong>tree-based models</strong> such as decision trees, random forests, and gradient boosting machines are generally invariant to monotonic transformations of individual predictors. Since splits are based on ordering rather than distance or magnitude, scaling is often unnecessary for these methods. Nevertheless, this does not imply that normalization is universally irrelevant in tree-based pipelines. Hybrid workflows—where tree-based models are combined with distance-based components, rule-based similarity measures, or downstream models sensitive to scale—may still require careful consideration of scaling choices. The key point is not that normalization should always be applied, but that it should be applied <strong>with respect to the assumptions of the modeling approach</strong>.</p>
<p>From a broader perspective, normalization plays a central role in modern predictive modeling workflows. As emphasized in the predictive modeling literature, preprocessing steps are not independent of the model; they are part of the modeling strategy itself. Scaling decisions shape how information is represented and, ultimately, how learning takes place. Understanding <em>why</em> normalization is necessary is therefore a prerequisite for deciding <em>when</em> and <em>how</em> it should be applied—a topic we address next.</p>
</section>
<section id="should-normalization-always-be-applied" class="level2" data-number="4">
<h2 data-number="4" class="anchored" data-anchor-id="should-normalization-always-be-applied"><span class="header-section-number">4</span> Should Normalization Always Be Applied?</h2>
<p>A natural question at this point is whether normalization should be applied by default in every modeling task. The short answer is <strong>no</strong>. Normalization is not a universally beneficial preprocessing step; its usefulness depends on the assumptions and internal mechanics of the chosen model. Applying scaling blindly can be as problematic as ignoring it altogether. What is needed is a <strong>decision framework</strong> that links model characteristics to preprocessing choices.</p>
<p>For a large class of models, normalization is <strong>strongly recommended</strong>. This group includes distance-based methods such as k-nearest neighbors (KNN) and K-means clustering, as well as techniques like principal component analysis (PCA), support vector machines (SVMs), neural networks, and penalized regression models (Ridge, Lasso, Elastic Net). In all these cases, either distances, inner products, variance directions, or penalty terms play a central role. Without scaling, these mechanisms are dominated by variables with larger numerical ranges, leading to distorted learning behavior. For such models, normalization is not a refinement but a prerequisite for meaningful results.</p>
<p>By contrast, normalization is <strong>generally unnecessary</strong> for tree-based models such as decision trees, random forests, and gradient boosting machines (e.g., XGBoost, GBM). These models rely on recursive binary splits based on variable ordering rather than on distances or magnitudes. Since monotonic transformations do not affect the relative ordering of values, scaling typically has no impact on model performance. As a result, normalization is often omitted in purely tree-based pipelines without any loss of effectiveness.</p>
<p>Between these two extremes lies a set of models for which normalization is <strong>context-dependent</strong>. Ordinary linear regression, for example, does not require scaling for estimation itself, but normalization may still be useful for numerical stability, interpretability of coefficients, or comparability across predictors. Similarly, Naive Bayes models may or may not benefit from scaling depending on the assumed feature distributions and the types of variables involved. In these cases, the decision to normalize should be guided by the modeling objective rather than by a fixed rule.</p>
<p>The key takeaway is that normalization should be applied <strong>with respect to the model’s assumptions</strong>, not as a default preprocessing habit. To make this decision explicit, Table 1 summarizes common modeling approaches and whether normalization is typically required.</p>
<section id="when-is-normalization-needed-a-model-based-decision-table" class="level3" data-number="4.1">
<h3 data-number="4.1" class="anchored" data-anchor-id="when-is-normalization-needed-a-model-based-decision-table"><span class="header-section-number">4.1</span> When Is Normalization Needed? A Model-Based Decision Table</h3>
<table class="caption-top table">
<colgroup>
<col style="width: 26%">
<col style="width: 26%">
<col style="width: 47%">
</colgroup>
<thead>
<tr class="header">
<th>Model / Method</th>
<th>Is Normalization Recommended?</th>
<th>Rationale</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>KNN</td>
<td>Yes</td>
<td>Distance calculations are scale-sensitive</td>
</tr>
<tr class="even">
<td>K-means</td>
<td>Yes</td>
<td>Cluster assignment depends on distances</td>
</tr>
<tr class="odd">
<td>PCA</td>
<td>Yes</td>
<td>Variance directions dominated by scale</td>
</tr>
<tr class="even">
<td>SVM</td>
<td>Yes</td>
<td>Optimization and margins depend on feature magnitude</td>
</tr>
<tr class="odd">
<td>Neural Networks</td>
<td>Yes</td>
<td>Gradient-based optimization is scale-sensitive</td>
</tr>
<tr class="even">
<td>Ridge / Lasso / Elastic Net</td>
<td>Yes</td>
<td>Penalty terms depend on predictor scale</td>
</tr>
<tr class="odd">
<td>Linear Regression (OLS)</td>
<td>Depends</td>
<td>Not required for estimation, but useful for stability and interpretation</td>
</tr>
<tr class="even">
<td>Naive Bayes</td>
<td>Depends</td>
<td>Depends on feature types and distributional assumptions</td>
</tr>
<tr class="odd">
<td>Decision Trees</td>
<td>No</td>
<td>Split rules depend on ordering, not scale</td>
</tr>
<tr class="even">
<td>Random Forest / GBM / XGBoost</td>
<td>No</td>
<td>Tree-based structure is scale-invariant</td>
</tr>
</tbody>
</table>
</section>
</section>
<section id="when-should-normalization-be-applied-before-or-after-the-traintest-split" class="level2" data-number="5">
<h2 data-number="5" class="anchored" data-anchor-id="when-should-normalization-be-applied-before-or-after-the-traintest-split"><span class="header-section-number">5</span> When Should Normalization Be Applied? Before or After the Train–Test Split?</h2>
<p>This is the most critical question in the entire preprocessing workflow—and the point at which many otherwise sound analyses quietly go wrong. The issue is not <em>whether</em> normalization should be applied, but <strong>when</strong> it should be applied. At the center of this question lies a fundamental concept in predictive modeling: <strong>data leakage</strong>.</p>
<p>Data leakage occurs when information from outside the training set is used, directly or indirectly, during model training. In the context of normalization, leakage typically arises when scaling parameters—such as means and standard deviations (for standardization) or minimum and maximum values (for min–max scaling)—are estimated using the full dataset before splitting into training and test sets. Although this may appear harmless, it allows information from the test set to influence the preprocessing step, leading to overly optimistic performance estimates.</p>
<p>The correct principle is straightforward but non-negotiable:<br>
<strong>scaling parameters must be learned exclusively from the training data</strong>.<br>
Once learned, the <em>same transformation</em>—with fixed parameters—must be applied to the test set and to any future, unseen data. This ensures that the test set truly represents new information and that model evaluation reflects genuine generalization rather than procedural artifacts.</p>
<p>This principle is central to modern modeling frameworks. In the <strong>tidymodels/recipes</strong> philosophy, preprocessing steps are <em>trained</em> on the training data and then <em>applied</em> consistently to all other datasets. Similarly, in the <strong>caret</strong> framework, preprocessing transformations are estimated from the training set and reused when predicting on new data. In both cases, preprocessing is treated as part of the model training process—not as an independent, preliminary operation.</p>
<p>To see why this distinction matters, consider the following conceptual comparison.</p>
<section id="an-illustrative-example-scaling-before-vs.-after-the-split" class="level3" data-number="5.1">
<h3 data-number="5.1" class="anchored" data-anchor-id="an-illustrative-example-scaling-before-vs.-after-the-split"><span class="header-section-number">5.1</span> An Illustrative Example: Scaling Before vs.&nbsp;After the Split</h3>
<p>Suppose we have a dataset that we intend to split into training and test sets. We want to standardize a numeric predictor using z-score scaling.</p>
<p><strong>Incorrect approach (scaling before the split):</strong></p>
<ol type="1">
<li><p>Compute the mean and standard deviation using the <em>entire dataset</em>.</p></li>
<li><p>Standardize all observations using these global parameters.</p></li>
<li><p>Split the scaled data into training and test sets.</p></li>
<li><p>Train and evaluate the model.</p></li>
</ol>
<p>At first glance, this workflow seems efficient. However, the scaling parameters already incorporate information from the test set. The test data are no longer independent of the training process, even though they were not explicitly used to fit the model.</p>
<p><strong>Correct approach (scaling after the split):</strong></p>
<ol type="1">
<li><p>Split the raw data into training and test sets.</p></li>
<li><p>Compute scaling parameters (mean, standard deviation, etc.) <em>using only the training set</em>.</p></li>
<li><p>Apply the learned transformation to the training set.</p></li>
<li><p>Apply the <em>same</em> transformation to the test set.</p></li>
<li><p>Train the model on the scaled training data and evaluate it on the scaled test data.</p></li>
</ol>
<p>In practice, these two approaches can lead to noticeably different evaluation results. Models trained using the incorrect workflow often appear to perform better on the test set—not because they generalize better, but because the preprocessing step has already “seen” the test data. This difference is especially pronounced in smaller datasets, in datasets with strong distributional differences between training and test splits, or when extreme values are present.</p>
<p>The takeaway is unambiguous:</p>
<blockquote class="blockquote">
<p><strong>Split the data first.<br>
Fit preprocessing steps on the training data.<br>
Apply the same transformations to the training and test sets.</strong></p>
</blockquote>
<p>Any deviation from this sequence undermines the validity of model evaluation, regardless of how sophisticated the modeling technique may be.</p>
</section>
</section>
<section id="common-normalization-methods-and-when-to-use-them" class="level2" data-number="6">
<h2 data-number="6" class="anchored" data-anchor-id="common-normalization-methods-and-when-to-use-them"><span class="header-section-number">6</span> Common Normalization Methods and When to Use Them</h2>
<p>Normalization is not a single technique but a family of transformations, each designed to address a specific modeling concern. Choosing an appropriate method requires understanding <strong>what problem the transformation is solving</strong> and <strong>which assumptions it implicitly makes</strong>. In this section, we review the most commonly used scaling approaches, discuss their strengths and limitations, and clarify when each method is appropriate.</p>
<section id="z-score-standardization" class="level3" data-number="6.1">
<h3 data-number="6.1" class="anchored" data-anchor-id="z-score-standardization"><span class="header-section-number">6.1</span> Z-score Standardization</h3>
<p>Z-score standardization rescales a variable so that it has a mean of zero and a standard deviation of one. Each observation <img src="https://latex.codecogs.com/png.latex?x_i"> is transformed as:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Az_i%20=%20%5Cfrac%7Bx_i%20-%20%5Cmu%7D%7B%5Csigma%7D,%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?%5Cmu"> denotes the sample mean and <img src="https://latex.codecogs.com/png.latex?%5Csigma"> the sample standard deviation, both estimated <strong>from the training data only</strong>.</p>
<p><strong>Advantages.</strong><br>
Z-score standardization places variables on a comparable scale while preserving the shape of their original distributions. It is particularly suitable for models that rely on inner products, gradient-based optimization, or regularization (e.g., penalized linear models, SVMs, neural networks).</p>
<p><strong>Limitations.</strong><br>
A widespread misconception is that standardization assumes normally distributed data. This is incorrect. Z-score scaling does <strong>not</strong> require normality; it only uses the first two moments of the distribution. However, it is sensitive to extreme values: large outliers can inflate <img src="https://latex.codecogs.com/png.latex?%5Csigma">, thereby reducing the relative influence of most observations.</p>
<p><strong>When to use.</strong><br>
A strong default choice when predictors differ substantially in scale and when outliers are either absent or have already been treated.</p>
<hr>
</section>
<section id="minmax-range-scaling" class="level3" data-number="6.2">
<h3 data-number="6.2" class="anchored" data-anchor-id="minmax-range-scaling"><span class="header-section-number">6.2</span> Min–Max (Range) Scaling</h3>
<p>Min–max scaling rescales variables to a fixed interval, most commonly <img src="https://latex.codecogs.com/png.latex?%5B0,%201%5D">. The transformation is:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Ax_i%5E%7B*%7D%20=%20%5Cfrac%7Bx_i%20-%20%5Cmin(x)%7D%7B%5Cmax(x)%20-%20%5Cmin(x)%7D.%0A"></p>
<p><strong>Advantages.</strong><br>
Intuitive and ensures all transformed values lie within a predefined range. Often used when bounded inputs are desirable (e.g., some neural network settings).</p>
<p><strong>Limitations.</strong><br>
Highly sensitive to extreme values: a single outlier can stretch the range and compress most observations. Also, when applied to test or future data, transformed values may fall outside <img src="https://latex.codecogs.com/png.latex?%5B0,1%5D"> if they exceed the training-set min/max. This is expected and must be handled in deployment.</p>
<p><strong>When to use.</strong><br>
When input bounds are meaningful and the training data represent the likely range of future observations.</p>
<hr>
</section>
<section id="robust-scaling-median-and-iqr" class="level3" data-number="6.3">
<h3 data-number="6.3" class="anchored" data-anchor-id="robust-scaling-median-and-iqr"><span class="header-section-number">6.3</span> Robust Scaling (Median and IQR)</h3>
<p>Robust scaling replaces mean and standard deviation with the median and the interquartile range (IQR). The transformation is:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Ax_i%5E%7B*%7D%20=%20%5Cfrac%7Bx_i%20-%20%5Cmathrm%7Bmedian%7D(x)%7D%7B%5Cmathrm%7BIQR%7D(x)%7D,%0A"></p>
<p>where:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathrm%7BIQR%7D(x)%20=%20Q_%7B0.75%7D%20-%20Q_%7B0.25%7D.%0A"></p>
<p><strong>Advantages.</strong><br>
Less affected by extreme values and heavy-tailed distributions; useful when outliers are meaningful rather than errors.</p>
<p><strong>Limitations.</strong><br>
Not universally stable. In highly concentrated variables, <img src="https://latex.codecogs.com/png.latex?%5Cmathrm%7BIQR%7D(x)"> (or related robust measures such as MAD) may be zero or extremely small, making the transformation unstable or undefined. This must be checked explicitly.</p>
<p><strong>When to use.</strong><br>
When outliers are present and structurally inherent, and you want scaling that is less sensitive to extremes.</p>
<hr>
</section>
<section id="power-transformations-combined-with-scaling-boxcox-and-yeojohnson" class="level3" data-number="6.4">
<h3 data-number="6.4" class="anchored" data-anchor-id="power-transformations-combined-with-scaling-boxcox-and-yeojohnson"><span class="header-section-number">6.4</span> Power Transformations Combined with Scaling (Box–Cox and Yeo–Johnson)</h3>
<p>Power transformations aim to stabilize variance and reduce skewness before scaling.</p>
<p>The <strong>Box–Cox transformation</strong> (for strictly positive data) is:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Ax_i%5E%7B(%5Clambda)%7D%20=%0A%5Cbegin%7Bcases%7D%0A%5Cfrac%7Bx_i%5E%7B%5Clambda%7D%20-%201%7D%7B%5Clambda%7D,%20&amp;%20%5Clambda%20%5Cneq%200,%20%5C%5C%5C%5C%0A%5Clog(x_i),%20&amp;%20%5Clambda%20=%200.%0A%5Cend%7Bcases%7D%0A"></p>
<p>The <strong>Yeo–Johnson transformation</strong> (allows zero and negative values) is:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Ax_i%5E%7B(%5Clambda)%7D%20=%0A%5Cbegin%7Bcases%7D%0A%5Cfrac%7B(x_i%20+%201)%5E%7B%5Clambda%7D%20-%201%7D%7B%5Clambda%7D,%20&amp;%20x_i%20%5Cge%200,%5C%20%5Clambda%20%5Cneq%200,%20%5C%5C%5C%5C%0A%5Clog(x_i%20+%201),%20&amp;%20x_i%20%5Cge%200,%5C%20%5Clambda%20=%200,%20%5C%5C%5C%5C%0A-%5Cfrac%7B(-x_i%20+%201)%5E%7B2%20-%20%5Clambda%7D%20-%201%7D%7B2%20-%20%5Clambda%7D,%20&amp;%20x_i%20%3C%200,%5C%20%5Clambda%20%5Cneq%202,%20%5C%5C%5C%5C%0A-%5Clog(-x_i%20+%201),%20&amp;%20x_i%20%3C%200,%5C%20%5Clambda%20=%202.%0A%5Cend%7Bcases%7D%0A"></p>
<p><strong>Why combine with scaling?</strong><br>
Power transformations modify distributional shape but do not put variables on a common scale. After applying Box–Cox or Yeo–Johnson, variables are typically centered and scaled.</p>
<p><strong>Order matters.</strong><br>
A practical default sequence is: <strong>power transformation → centering → scaling</strong>. Scaling before addressing skewness can weaken the effect of the transformation and complicate interpretation.</p>
<p><strong>When to use.</strong><br>
When strong skewness or heteroscedasticity is present and when model assumptions or optimization benefit from more symmetric distributions.</p>
<hr>
</section>
<section id="choosing-a-method-no-single-best-answer" class="level3" data-number="6.5">
<h3 data-number="6.5" class="anchored" data-anchor-id="choosing-a-method-no-single-best-answer"><span class="header-section-number">6.5</span> Choosing a Method: No Single Best Answer</h3>
<p>There is no universally optimal normalization method. Each approach reflects a trade-off between robustness, interpretability, and sensitivity to data characteristics. The appropriate choice depends on the model, the data structure, and the modeling objective.</p>
<blockquote class="blockquote">
<p>The relevant question is not <em>“Which normalization method is best?”</em><br>
but <em>“Which transformation aligns with my data and my model’s assumptions?”</em></p>
</blockquote>
</section>
</section>
<section id="do-different-data-types-require-different-scaling-strategies" class="level2" data-number="7">
<h2 data-number="7" class="anchored" data-anchor-id="do-different-data-types-require-different-scaling-strategies"><span class="header-section-number">7</span> Do Different Data Types Require Different Scaling Strategies?</h2>
<p>Normalization decisions should never be made independently of data types. Different variable types carry different semantic meanings, and applying the same scaling strategy indiscriminately can lead to misleading representations or unnecessary transformations. A principled preprocessing workflow therefore begins by distinguishing between variable types and understanding how each interacts with scaling.</p>
<section id="continuous-numeric-variables" class="level3" data-number="7.1">
<h3 data-number="7.1" class="anchored" data-anchor-id="continuous-numeric-variables"><span class="header-section-number">7.1</span> Continuous Numeric Variables</h3>
<p>Continuous numeric variables are the primary candidates for normalization. When such variables are measured on different scales—such as income in thousands and proportions between 0 and 1—scaling is often essential for models that rely on distances, gradients, or regularization. Z-score standardization, min–max scaling, or robust scaling are all reasonable options, depending on the presence of outliers and the modeling objective.</p>
<p>In practice, most normalization methods are designed with continuous variables in mind, and applying them here rarely raises conceptual concerns. The main decision revolves around <em>which</em> scaling method is most appropriate, not <em>whether</em> scaling should be applied at all.</p>
<hr>
</section>
<section id="count-and-ordinal-numeric-variables" class="level3" data-number="7.2">
<h3 data-number="7.2" class="anchored" data-anchor-id="count-and-ordinal-numeric-variables"><span class="header-section-number">7.2</span> Count and Ordinal Numeric Variables</h3>
<p>Some numeric variables are technically continuous in storage but conceptually represent counts or ordered categories. Examples include the number of visits, rankings, Likert-scale responses, or discrete event counts. Treating such variables as purely continuous can be problematic, especially when their distributions are highly skewed or bounded at zero.</p>
<p>In these cases, applying a logarithmic or power transformation before scaling is often more appropriate than direct normalization. Power transformations can reduce skewness and stabilize variance, after which standardization or robust scaling may be applied. The key point is that <strong>the meaning of the variable matters</strong>: a difference of one unit in a count variable does not necessarily carry the same interpretation across its range.</p>
<hr>
</section>
<section id="categorical-variables-factors-or-characters" class="level3" data-number="7.3">
<h3 data-number="7.3" class="anchored" data-anchor-id="categorical-variables-factors-or-characters"><span class="header-section-number">7.3</span> Categorical Variables (Factors or Characters)</h3>
<p>Categorical variables should <strong>never</strong> be scaled directly. Their values represent qualitative categories rather than numerical magnitudes, and applying normalization to raw category codes is meaningless.</p>
<p>When categorical variables are included in models that require numeric inputs, they must first be transformed using an encoding scheme such as one-hot (dummy) encoding. After encoding, the question of scaling arises again. In many cases, scaling encoded variables is unnecessary. However, in penalized regression models or distance-based methods, normalization of one-hot encoded variables may be beneficial to ensure that categorical and continuous predictors are treated on comparable scales.</p>
<p>The important distinction is that scaling applies <strong>after encoding</strong>, not before, and only when the model’s assumptions justify it.</p>
<hr>
</section>
<section id="binary-variables-01-indicators" class="level3" data-number="7.4">
<h3 data-number="7.4" class="anchored" data-anchor-id="binary-variables-01-indicators"><span class="header-section-number">7.4</span> Binary Variables (0/1 Indicators)</h3>
<p>Binary variables occupy a special position. Since they already lie on a fixed and interpretable scale, normalization is usually unnecessary and may even obscure interpretation. For many models, leaving binary indicators unchanged is the most transparent choice.</p>
<p>That said, binary variables often enter preprocessing pipelines automatically when a rule such as “scale all numeric predictors” is applied. In such cases, standardization will transform a 0/1 variable into values centered around zero with unit variance. While this does not usually harm model performance, it changes the interpretation of coefficients and can complicate downstream analysis.</p>
<p>This highlights an important practical lesson: automated preprocessing pipelines should be used with care. Even when a transformation is mathematically valid, it may not be conceptually desirable for all variable types.</p>
<hr>
</section>
<section id="summary-scaling-depends-on-variable-meaning" class="level3" data-number="7.5">
<h3 data-number="7.5" class="anchored" data-anchor-id="summary-scaling-depends-on-variable-meaning"><span class="header-section-number">7.5</span> Summary: Scaling Depends on Variable Meaning</h3>
<p>The decision to normalize should always be guided by the <em>semantic role</em> of a variable, not merely by its storage type. Continuous measurements, counts, ordered responses, categorical indicators, and binary flags interact with scaling in fundamentally different ways. Effective preprocessing therefore requires more than applying a generic rule—it requires aligning transformations with the structure and meaning of the data.</p>
</section>
</section>
<section id="should-all-variables-be-scaled" class="level2" data-number="8">
<h2 data-number="8" class="anchored" data-anchor-id="should-all-variables-be-scaled"><span class="header-section-number">8</span> Should All Variables Be Scaled?</h2>
<p>A common mistake in preprocessing workflows is to treat normalization as a blanket operation applied to every variable in the dataset. In reality, <strong>not all variables should be scaled</strong>, and doing so indiscriminately can reduce interpretability or even introduce unintended distortions. Scaling decisions must therefore be made at the variable level, guided by both statistical and semantic considerations.</p>
<section id="the-target-variable-y" class="level3" data-number="8.1">
<h3 data-number="8.1" class="anchored" data-anchor-id="the-target-variable-y"><span class="header-section-number">8.1</span> The Target Variable (y)</h3>
<p>In most predictive modeling tasks, the target variable should <strong>not</strong> be normalized. Scaling the response does not improve model estimation and often complicates interpretation, particularly in regression settings where coefficients and predictions are expected to be expressed in the original units.</p>
<p>There are, however, notable exceptions. In neural network regression or other optimization-heavy models, scaling the target variable can improve numerical stability and convergence behavior. In such cases, predictions must be transformed back to the original scale before evaluation and interpretation. Outside these specific contexts, leaving the target variable unchanged remains the standard and preferred practice.</p>
<hr>
</section>
<section id="predictor-variables" class="level3" data-number="8.2">
<h3 data-number="8.2" class="anchored" data-anchor-id="predictor-variables"><span class="header-section-number">8.2</span> Predictor Variables</h3>
<p>For predictor variables, scaling should be applied selectively rather than universally.</p>
<section id="numeric-predictors-only" class="level4" data-number="8.2.1">
<h4 data-number="8.2.1" class="anchored" data-anchor-id="numeric-predictors-only"><span class="header-section-number">8.2.1</span> Numeric Predictors Only</h4>
<p>Normalization is meaningful only for numeric predictors. Applying scaling to non-numeric variables—either directly or implicitly through arbitrary numeric coding—has no conceptual justification. As discussed earlier, categorical variables must first be encoded, and even then, scaling is optional and model-dependent.</p>
</section>
<section id="excluding-non-informative-numeric-variables" class="level4" data-number="8.2.2">
<h4 data-number="8.2.2" class="anchored" data-anchor-id="excluding-non-informative-numeric-variables"><span class="header-section-number">8.2.2</span> Excluding Non-informative Numeric Variables</h4>
<p>Not all numeric variables carry meaningful quantitative information. Identifier variables such as IDs, account numbers, or arbitrary codes may be stored as numeric values but do not represent magnitudes or distances. Scaling such variables is meaningless and potentially harmful, as it introduces artificial structure where none exists. These variables should be excluded from the modeling process altogether, not merely from scaling.</p>
</section>
<section id="handling-low-variance-predictors" class="level4" data-number="8.2.3">
<h4 data-number="8.2.3" class="anchored" data-anchor-id="handling-low-variance-predictors"><span class="header-section-number">8.2.3</span> Handling Low-Variance Predictors</h4>
<p>Variables with extremely low or zero variance provide little to no information for modeling. Scaling such predictors does not solve the underlying problem; it merely rescales noise. In practice, low-variance and zero-variance predictors should be identified and removed <strong>before</strong> normalization.</p>
<p>Many preprocessing frameworks formalize this step. For example, approaches based on the logic of zero-variance or near-zero-variance filtering (often referred to as <code>zv</code> or <code>nzv</code> steps) ensure that only informative predictors enter the scaling stage. This not only improves computational efficiency but also reduces the risk of numerical instability in downstream models.</p>
<hr>
</section>
</section>
<section id="a-practical-rule-of-thumb" class="level3" data-number="8.3">
<h3 data-number="8.3" class="anchored" data-anchor-id="a-practical-rule-of-thumb"><span class="header-section-number">8.3</span> A Practical Rule of Thumb</h3>
<p>A disciplined preprocessing workflow follows a clear sequence:</p>
<ol type="1">
<li>Identify and remove non-informative variables (IDs, constants, near-constants).</li>
<li>Select numeric predictors that represent meaningful quantities.</li>
<li>Apply appropriate scaling only to this subset.</li>
<li>Leave the target variable unscaled, unless there is a compelling model-specific reason to do otherwise.</li>
</ol>
<p>Scaling is most effective when it is <strong>deliberate and selective</strong>, not automatic. Treating normalization as a universal operation may simplify code, but it rarely leads to better models.</p>
</section>
</section>
<section id="application-plan-in-r-data-and-modeling-scenario" class="level2" data-number="9">
<h2 data-number="9" class="anchored" data-anchor-id="application-plan-in-r-data-and-modeling-scenario"><span class="header-section-number">9</span> Application Plan in R: Data and Modeling Scenario</h2>
<p>To demonstrate the practical implications of normalization decisions, we use the <strong>Ames Housing</strong> dataset, a well-known benchmark dataset designed for predictive modeling. The dataset contains <strong>2,930 observations</strong> and a rich set of predictors describing residential properties in Ames, Iowa. These predictors span multiple data types, including continuous numeric variables, discrete counts, ordinal ratings, and categorical features. This diversity makes the dataset particularly suitable for illustrating how scaling interacts with different variable types.</p>
<p>The Ames Housing dataset is distributed within the <strong>modeldata</strong> package in the tidymodels ecosystem. It was explicitly curated for teaching and methodological demonstrations, ensuring a realistic but well-documented structure. The presence of variables measured on vastly different scales—such as living area, lot size, and quality scores—provides a natural setting for exploring the effects of normalization.</p>
<section id="modeling-objective" class="level3" data-number="9.1">
<h3 data-number="9.1" class="anchored" data-anchor-id="modeling-objective"><span class="header-section-number">9.1</span> Modeling Objective</h3>
<p>The primary goal of this application is <strong>not</strong> to optimize predictive performance, but to isolate and examine the impact of different normalization strategies. For this reason, the modeling task is intentionally kept simple. We focus on predicting the <strong>sale price of a house</strong> as a regression problem, using a fixed model specification across all experiments.</p>
<p>The model itself serves merely as a vehicle for comparison. By holding the model constant and varying only the preprocessing strategy, we can attribute differences in performance and behavior directly to scaling decisions rather than to model complexity or tuning choices.</p>
</section>
<section id="scope-and-focus" class="level3" data-number="9.2">
<h3 data-number="9.2" class="anchored" data-anchor-id="scope-and-focus"><span class="header-section-number">9.2</span> Scope and Focus</h3>
<p>Throughout the application section, the emphasis remains firmly on preprocessing:</p>
<ul>
<li>the same training–test split is used across all scenarios,</li>
<li>the same set of predictors is retained,</li>
<li>the same model structure is applied.</li>
</ul>
<p>Only the normalization strategy changes. This design allows us to answer a focused question:</p>
<blockquote class="blockquote">
<p><em>How much do scaling choices matter when everything else is kept equal?</em></p>
</blockquote>
<p>By structuring the analysis in this way, the results highlight normalization as an integral component of the modeling pipeline rather than a secondary technical detail.</p>
<hr>
</section>
<section id="transition-to-implementation" class="level3" data-number="9.3">
<h3 data-number="9.3" class="anchored" data-anchor-id="transition-to-implementation"><span class="header-section-number">9.3</span> Transition to Implementation</h3>
<p>In the next section, we move from design to execution. We begin by defining a train–test split and establishing a baseline preprocessing workflow. From there, we introduce alternative normalization strategies and compare their effects using consistent evaluation criteria.</p>
</section>
<section id="data-access-and-availability" class="level3" data-number="9.4">
<h3 data-number="9.4" class="anchored" data-anchor-id="data-access-and-availability"><span class="header-section-number">9.4</span> Data Access and Availability</h3>
<p>The Ames Housing dataset used in this application is available through the <strong>modeldata</strong> package, which is part of the tidymodels ecosystem. No external download is required. Once the package is installed, the dataset can be accessed directly within R.</p>
<p>The dataset is provided for educational and methodological purposes and is accompanied by detailed documentation. For reference, the official description is available at:</p>
<p><a href="https://modeldata.tidymodels.org/reference/ames.html" class="uri">https://modeldata.tidymodels.org/reference/ames.html</a></p>
<p>In the next section, we load the dataset directly from the package and proceed with the train–test split and preprocessing workflow.</p>
</section>
</section>
<section id="implementation-in-r-split-baseline-and-the-cost-of-doing-it-wrong" class="level2" data-number="10">
<h2 data-number="10" class="anchored" data-anchor-id="implementation-in-r-split-baseline-and-the-cost-of-doing-it-wrong"><span class="header-section-number">10</span> Implementation in R: Split, Baseline, and the Cost of Doing It Wrong</h2>
<p>In this section, we operationalize the key principle introduced earlier:</p>
<blockquote class="blockquote">
<p><strong>Split → fit preprocessing on train → apply to train/test</strong></p>
</blockquote>
<p>We use the Ames Housing dataset from the <code>modeldata</code> package (no external download required) and compare three pipelines using the <strong>same model</strong>:</p>
<ol type="1">
<li><strong>Baseline (no scaling)</strong></li>
<li><strong>Incorrect scaling (data leakage)</strong>: scaling parameters learned from the full dataset</li>
<li><strong>Correct scaling</strong>: scaling parameters learned from the training set only</li>
</ol>
<p>The goal is not to build the best possible model but to <strong>isolate the effect of scaling decisions</strong>.</p>
<section id="setup-and-variable-selection" class="level3" data-number="10.1">
<h3 data-number="10.1" class="anchored" data-anchor-id="setup-and-variable-selection"><span class="header-section-number">10.1</span> Setup and Variable Selection</h3>
<p>Before defining any model, we clarify what we are modeling and why these variables are used.</p>
<p><strong>Modeling goal.</strong><br>
We treat <code>Sale_Price</code> as the target variable and build a regression model that predicts house sale prices based on a small set of numeric predictors. The purpose is not to maximize predictive accuracy, but to create a controlled environment where the effect of scaling choices is easy to observe.</p>
<p><strong>Why a small subset of predictors?</strong><br>
The Ames dataset contains many variables, including categorical and ordinal predictors. For the normalization demonstrations, we intentionally select a compact set of <strong>numeric</strong> features with clearly different measurement scales. This makes the consequences of scaling (and data leakage) more visible and easier to interpret.</p>
<p><strong>Selected variables (interpretation).</strong></p>
<ul>
<li><p><code>Sale_Price</code>: sale price of the house (response variable).</p></li>
<li><p><code>Gr_Liv_Area</code>: above-ground living area (a size-related continuous measure).</p></li>
<li><p><code>Lot_Area</code>: lot size (typically much larger numeric range than living area).</p></li>
<li><p><code>Year_Built</code>: construction year (a temporal numeric variable).</p></li>
<li><p><code>Overall_Cond</code>: overall condition rating (an ordinal-like numeric score).</p></li>
<li><p><code>Latitude</code>, <code>Longitude</code>: geographic coordinates capturing location effects.</p></li>
</ul>
<hr>
</section>
<section id="load-data-and-create-a-working-dataset" class="level3" data-number="10.2">
<h3 data-number="10.2" class="anchored" data-anchor-id="load-data-and-create-a-working-dataset"><span class="header-section-number">10.2</span> Load Data and Create a Working Dataset</h3>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(tidymodels)</span>
<span id="cb1-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(modeldata)</span>
<span id="cb1-3"></span>
<span id="cb1-4"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data</span>(ames, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">package =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"modeldata"</span>)</span>
<span id="cb1-5"></span>
<span id="cb1-6"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">set.seed</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2026</span>)</span>
<span id="cb1-7"></span>
<span id="cb1-8">ames_small <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> ames <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb1-9">  dplyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">select</span>(</span>
<span id="cb1-10">    Sale_Price,</span>
<span id="cb1-11">    Gr_Liv_Area,</span>
<span id="cb1-12">    Lot_Area,</span>
<span id="cb1-13">    Year_Built,</span>
<span id="cb1-14">    Overall_Cond,</span>
<span id="cb1-15">    Latitude,</span>
<span id="cb1-16">    Longitude</span>
<span id="cb1-17">  )</span>
<span id="cb1-18"></span>
<span id="cb1-19"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Missing-value check within the selected columns</span></span>
<span id="cb1-20">ames_small <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb1-21">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summarise</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">across</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">everything</span>(), <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sum</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(.)))) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb1-22">  tidyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pivot_longer</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">everything</span>(), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">names_to =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"variable"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">values_to =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"n_missing"</span>)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code># A tibble: 7 × 2
  variable     n_missing
  &lt;chr&gt;            &lt;int&gt;
1 Sale_Price           0
2 Gr_Liv_Area          0
3 Lot_Area             0
4 Year_Built           0
5 Overall_Cond         0
6 Latitude             0
7 Longitude            0</code></pre>
</div>
</div>
<p>This step constructs a clean working dataset (<code>ames_small</code>) and confirms whether missing values exist in the selected columns. For the comparisons in the next sections, it is important that the pipelines differ only by preprocessing choices (e.g., scaling), not by inconsistent handling of missing data.</p>
</section>
<section id="traintest-split-and-evaluation-setup" class="level3" data-number="10.3">
<h3 data-number="10.3" class="anchored" data-anchor-id="traintest-split-and-evaluation-setup"><span class="header-section-number">10.3</span> Train–Test Split and Evaluation Setup</h3>
<p>Before discussing scaling, we must establish a clean evaluation setup. The key idea is simple:</p>
<blockquote class="blockquote">
<p><strong>Split first. Then learn any preprocessing parameters from the training set only.</strong></p>
</blockquote>
<p>Without a proper train–test split, we cannot meaningfully talk about generalization, and any comparison involving normalization risks becoming misleading.</p>
<hr>
<section id="create-a-stratified-traintest-split" class="level4" data-number="10.3.1">
<h4 data-number="10.3.1" class="anchored" data-anchor-id="create-a-stratified-traintest-split"><span class="header-section-number">10.3.1</span> Create a Stratified Train–Test Split</h4>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb3-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">set.seed</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2026</span>)</span>
<span id="cb3-2"></span>
<span id="cb3-3">split_obj <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">initial_split</span>(ames_small, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">prop =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.80</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">strata =</span> Sale_Price)</span>
<span id="cb3-4"></span>
<span id="cb3-5">train_data <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">training</span>(split_obj)</span>
<span id="cb3-6">test_data  <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">testing</span>(split_obj)</span>
<span id="cb3-7"></span>
<span id="cb3-8"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">nrow</span>(train_data)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 2342</code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb5-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">nrow</span>(test_data)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 588</code></pre>
</div>
</div>
<p><strong>What this does.</strong></p>
<ul>
<li><p><code>prop = 0.80</code> assigns roughly 80% of the data to training and 20% to testing.</p></li>
<li><p><code>strata = Sale_Price</code> performs a <em>stratified</em> split based on the target variable.<br>
This reduces the risk that the test set ends up with an atypical concentration of very low or very high prices—something that can easily happen with skewed targets like house prices.</p></li>
</ul>
<p><strong>How to interpret the output.</strong></p>
<ul>
<li>If the full dataset contains 2,930 observations, you should see approximately:</li>
</ul>
<pre><code>-    training: 2,342 rows

-    test: 588 rows</code></pre>
<p>This corresponds closely to the intended 80/20 split and indicates that no unintended row loss occurred during preprocessing.</p>
</section>
<section id="sanity-check-is-the-target-distribution-similar-across-splits" class="level4" data-number="10.3.2">
<h4 data-number="10.3.2" class="anchored" data-anchor-id="sanity-check-is-the-target-distribution-similar-across-splits"><span class="header-section-number">10.3.2</span> Sanity Check: Is the Target Distribution Similar Across Splits?</h4>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb8-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">bind_rows</span>(</span>
<span id="cb8-2">  train_data <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">split =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"train"</span>),</span>
<span id="cb8-3">  test_data  <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">split =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"test"</span>)</span>
<span id="cb8-4">) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb8-5">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggplot</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> Sale_Price, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">fill =</span> split)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb8-6">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_histogram</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">bins =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">40</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">alpha =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.7</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">color =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"white"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb8-7">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">facet_wrap</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> split, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">scales =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"free_y"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb8-8">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">scale_fill_manual</span>(</span>
<span id="cb8-9">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">values =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">train =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#1f77b4"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">test =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#ff7f0e"</span>)</span>
<span id="cb8-10">  ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb8-11">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">labs</span>(</span>
<span id="cb8-12">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">title =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Sale_Price distribution after train–test split"</span>,</span>
<span id="cb8-13">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Sale_Price"</span>,</span>
<span id="cb8-14">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Count"</span>,</span>
<span id="cb8-15">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">fill =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Data split"</span></span>
<span id="cb8-16">  ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb8-17">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme_minimal</span>()</span></code></pre></div></div>
<div class="cell-output-display">
<div>
<figure class="figure">
<p><img src="https://mfatihtuzen.github.io/posts/2026-01-02_normalization/index_files/figure-html/unnamed-chunk-3-1.png" class="img-fluid figure-img" width="672"></p>
</figure>
</div>
</div>
</div>
<p><strong>What to look for.</strong></p>
<ul>
<li><p>Both distributions should be right-skewed with a similar central mass.</p></li>
<li><p>There should be no strong imbalance where most expensive (or cheapest) homes appear in only one split.</p></li>
</ul>
<p>In the plot, the overall shapes are highly similar and the mid-range is well represented in both sets, indicating that stratification preserved the structure of the target variable across splits.</p>
</section>
<section id="optional-check-quick-summary-statistics" class="level4" data-number="10.3.3">
<h4 data-number="10.3.3" class="anchored" data-anchor-id="optional-check-quick-summary-statistics"><span class="header-section-number">10.3.3</span> Optional Check: Quick Summary Statistics</h4>
<p>This is a compact numerical confirmation of what the plot shows.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb9-1">train_summary <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> train_data <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb9-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summarise</span>(</span>
<span id="cb9-3"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">split =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"train"</span>,</span>
<span id="cb9-4"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">n =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">n</span>(),</span>
<span id="cb9-5"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">mean =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(Sale_Price),</span>
<span id="cb9-6"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">median =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">median</span>(Sale_Price),</span>
<span id="cb9-7"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sd =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sd</span>(Sale_Price),</span>
<span id="cb9-8"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">min =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">min</span>(Sale_Price),</span>
<span id="cb9-9"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">max =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">max</span>(Sale_Price)</span>
<span id="cb9-10">)</span>
<span id="cb9-11"></span>
<span id="cb9-12">test_summary <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> test_data <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb9-13"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summarise</span>(</span>
<span id="cb9-14"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">split =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"test"</span>,</span>
<span id="cb9-15"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">n =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">n</span>(),</span>
<span id="cb9-16"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">mean =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(Sale_Price),</span>
<span id="cb9-17"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">median =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">median</span>(Sale_Price),</span>
<span id="cb9-18"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sd =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sd</span>(Sale_Price),</span>
<span id="cb9-19"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">min =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">min</span>(Sale_Price),</span>
<span id="cb9-20"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">max =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">max</span>(Sale_Price)</span>
<span id="cb9-21">)</span>
<span id="cb9-22"></span>
<span id="cb9-23"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">bind_rows</span>(train_summary, test_summary)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code># A tibble: 2 × 7
  split     n    mean median     sd   min    max
  &lt;chr&gt; &lt;int&gt;   &lt;dbl&gt;  &lt;dbl&gt;  &lt;dbl&gt; &lt;int&gt;  &lt;int&gt;
1 train  2342 180447. 160000 79157. 12789 755000
2 test    588 182185. 160500 82784. 35311 625000</code></pre>
</div>
</div>
<p><strong>How to interpret this.</strong></p>
<ul>
<li><p>Small differences between train and test are expected.</p></li>
<li><p>Large gaps—especially in the median—may indicate an unbalanced split.</p></li>
</ul>
<p>Your summaries show nearly identical means and medians (train: 180,447 / 160,000; test: 182,185 / 160,500) and similar standard deviations, supporting the conclusion that the split is well balanced. Differences in the maximum values are expected due to rare high-priced homes and do not indicate a problematic split.</p>
<p>The train–test split is well balanced and suitable for downstream modeling. The test set can be treated as a genuine proxy for unseen data, allowing us to evaluate normalization strategies without confounding effects from an unbalanced split.</p>
</section>
</section>
<section id="model-specification-a-scale-sensitive-baseline" class="level3" data-number="10.4">
<h3 data-number="10.4" class="anchored" data-anchor-id="model-specification-a-scale-sensitive-baseline"><span class="header-section-number">10.4</span> Model Specification: A Scale-Sensitive Baseline</h3>
<p>Before comparing different normalization strategies, we must fix the modeling component of the pipeline. This ensures that any performance differences observed later can be attributed to preprocessing choices rather than to changes in the model itself.</p>
<p><strong>Why KNN Regression?</strong></p>
<p>We deliberately choose <strong>k-nearest neighbors (KNN) regression</strong> for this demonstration. The reason is methodological, not practical.</p>
<p>KNN is a <strong>distance-based algorithm</strong>: predictions are determined by the distances between observations in the feature space. As a result, KNN is highly sensitive to the scale of the predictors. Variables with larger numeric ranges can dominate distance calculations, even if they are not substantively more important.</p>
<p>This property makes KNN an ideal diagnostic tool for studying the effects of scaling.</p>
<hr>
<section id="model-specification" class="level4" data-number="10.4.1">
<h4 data-number="10.4.1" class="anchored" data-anchor-id="model-specification"><span class="header-section-number">10.4.1</span> Model Specification</h4>
<p>We define a single KNN model that will be used in all subsequent scenarios.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb11" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb11-1">knn_spec <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">nearest_neighbor</span>(</span>
<span id="cb11-2">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">neighbors =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">15</span>,</span>
<span id="cb11-3">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">weight_func =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"rectangular"</span></span>
<span id="cb11-4">) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb11-5">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">set_engine</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"kknn"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb11-6">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">set_mode</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"regression"</span>)</span></code></pre></div></div>
</div>
<p><strong>Commentary.</strong></p>
<ul>
<li><p>The number of neighbors is fixed at 15 to reduce variance while maintaining locality.</p></li>
<li><p>No hyperparameter tuning is performed, as optimization is not the goal here.</p></li>
<li><p>This model specification will remain unchanged across all preprocessing pipelines.</p></li>
</ul>
</section>
</section>
<section id="scenario-a-baseline-no-scaling" class="level3" data-number="10.5">
<h3 data-number="10.5" class="anchored" data-anchor-id="scenario-a-baseline-no-scaling"><span class="header-section-number">10.5</span> Scenario A — Baseline: No Scaling</h3>
<p>We begin with a baseline workflow in which <strong>no scaling is applied</strong>. This provides a reference point against which all normalized pipelines will be compared.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb12" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb12-1">rec_none <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">recipe</span>(Sale_Price <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> ., <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data =</span> train_data)</span>
<span id="cb12-2"></span>
<span id="cb12-3">wf_none <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">workflow</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb12-4"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">add_recipe</span>(rec_none) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb12-5"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">add_model</span>(knn_spec)</span>
<span id="cb12-6"></span>
<span id="cb12-7">fit_none <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">fit</span>(wf_none, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data =</span> train_data)</span></code></pre></div></div>
</div>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Note
</div>
</div>
<div class="callout-body-container callout-body">
<blockquote class="blockquote">
<p><strong>Note on model engines.</strong><br>
In the tidymodels ecosystem, model specifications are defined independently of the underlying computational engines. Although we specify the KNN model via <code>nearest_neighbor()</code>, the actual implementation is provided by the <code>kknn</code> package.</p>
<p>If the package is not installed, fitting the model will fail. To proceed, install and load the required engine:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb13" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb13-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">install.packages</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"kknn"</span>)</span>
<span id="cb13-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(kknn)</span></code></pre></div></div>
<p>This separation between model specification and engine implementation is intentional and allows tidymodels to remain modular and extensible.</p>
</blockquote>
</div>
</div>
<section id="evaluate-on-the-test-set" class="level4" data-number="10.5.1">
<h4 data-number="10.5.1" class="anchored" data-anchor-id="evaluate-on-the-test-set"><span class="header-section-number">10.5.1</span> <strong>Evaluate on the Test Set</strong></h4>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb14" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb14-1">pred_none <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">predict</span>(fit_none, test_data) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb14-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">bind_cols</span>(test_data <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> dplyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">select</span>(Sale_Price))</span>
<span id="cb14-3"></span>
<span id="cb14-4">metrics_none <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> yardstick<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">metrics</span>(</span>
<span id="cb14-5">pred_none,</span>
<span id="cb14-6"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">truth =</span> Sale_Price,</span>
<span id="cb14-7"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">estimate =</span> .pred</span>
<span id="cb14-8">)</span>
<span id="cb14-9"></span>
<span id="cb14-10">metrics_none</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code># A tibble: 3 × 3
  .metric .estimator .estimate
  &lt;chr&gt;   &lt;chr&gt;          &lt;dbl&gt;
1 rmse    standard   35643.   
2 rsq     standard       0.816
3 mae     standard   23726.   </code></pre>
</div>
</div>
</section>
<section id="interpretation" class="level4" data-number="10.5.2">
<h4 data-number="10.5.2" class="anchored" data-anchor-id="interpretation"><span class="header-section-number">10.5.2</span> Interpretation</h4>
<p>These values are not “good” or “bad” in isolation; what matters is that they provide a <strong>stable reference</strong>. At this stage, the model operates on raw predictor scales. For a distance-based method like KNN, this implies:</p>
<ul>
<li><p>Predictors with larger numeric ranges (e.g., <code>Lot_Area</code>) can disproportionately influence distance calculations.</p></li>
<li><p>Smaller-range variables (e.g., ordinal-like <code>Overall_Cond</code>) may contribute less than intended.</p></li>
<li><p>The model’s behavior is therefore partially shaped by measurement units, not only by predictive structure.</p></li>
</ul>
<p>This is exactly why KNN is a useful diagnostic tool in a normalization-focused article: if scaling matters, we should see clear changes relative to this baseline once we introduce normalization.</p>
<p>Next, we introduce scaling—but <strong>incorrectly</strong> on purpose. We will apply normalization <em>before</em> the train–test split (i.e., using information from the full dataset). This creates <strong>data leakage</strong> and can lead to deceptively improved test performance.</p>
<p>After that, we will implement the correct workflow (fit scaling parameters on the training set only) and compare all scenarios side by side.</p>
</section>
</section>
<section id="scenario-b-incorrect-normalization-data-leakage" class="level3" data-number="10.6">
<h3 data-number="10.6" class="anchored" data-anchor-id="scenario-b-incorrect-normalization-data-leakage"><span class="header-section-number">10.6</span> Scenario B — Incorrect Normalization (Data Leakage)</h3>
<p>In this scenario, we intentionally apply normalization <strong>the wrong way</strong>: we learn scaling parameters from the full dataset (including what will become the test set). This contaminates the evaluation because preprocessing has already “seen” information from the test distribution.</p>
<p>The goal is not to recommend this approach, but to demonstrate how easily leakage can happen—and how it can artificially improve test metrics.</p>
<section id="leakage-pipeline-normalize-using-full-data" class="level4" data-number="10.6.1">
<h4 data-number="10.6.1" class="anchored" data-anchor-id="leakage-pipeline-normalize-using-full-data"><span class="header-section-number">10.6.1</span> Leakage Pipeline: Normalize Using Full Data</h4>
<p>The <code>step_normalize()</code> operation applies only to numeric predictors. In our dataset, <code>Overall_Cond</code> is stored as a factor (ordinal-like category), so it must not be normalized directly.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb16" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb16-1">rec_leak <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">recipe</span>(Sale_Price <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> ., <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data =</span> ames_small) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb16-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">step_normalize</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">all_numeric_predictors</span>())</span>
<span id="cb16-3"></span>
<span id="cb16-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># WRONG on purpose: prepping on full data (leakage), but now type-safe</span></span>
<span id="cb16-5">prep_leak <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">prep</span>(rec_leak, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">training =</span> ames_small)</span>
<span id="cb16-6"></span>
<span id="cb16-7">train_leak <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">bake</span>(prep_leak, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">new_data =</span> train_data)</span>
<span id="cb16-8">test_leak  <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">bake</span>(prep_leak, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">new_data =</span> test_data)</span>
<span id="cb16-9"></span>
<span id="cb16-10">wf_leak <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">workflow</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb16-11">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">add_model</span>(knn_spec) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb16-12">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">add_formula</span>(Sale_Price <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> .)</span>
<span id="cb16-13"></span>
<span id="cb16-14">fit_leak <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">fit</span>(wf_leak, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data =</span> train_leak)</span>
<span id="cb16-15"></span>
<span id="cb16-16">pred_leak <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">predict</span>(fit_leak, test_leak) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb16-17">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">bind_cols</span>(test_leak <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> dplyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">select</span>(Sale_Price))</span>
<span id="cb16-18"></span>
<span id="cb16-19">metrics_leak <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> yardstick<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">metrics</span>(pred_leak, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">truth =</span> Sale_Price, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">estimate =</span> .pred)</span>
<span id="cb16-20">metrics_leak</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code># A tibble: 3 × 3
  .metric .estimator .estimate
  &lt;chr&gt;   &lt;chr&gt;          &lt;dbl&gt;
1 rmse    standard   37036.   
2 rsq     standard       0.801
3 mae     standard   24411.   </code></pre>
</div>
</div>
</section>
<section id="interpretation-1" class="level4" data-number="10.6.2">
<h4 data-number="10.6.2" class="anchored" data-anchor-id="interpretation-1"><span class="header-section-number">10.6.2</span> Interpretation</h4>
<p>The performance obtained under this scenario reflects the consequences of <strong>incorrect normalization with data leakage</strong>.</p>
<p>Compared to the baseline (no scaling), all three metrics deteriorate. This indicates that learning normalization parameters from the full dataset does <strong>not</strong> automatically lead to better predictive performance. In this case, the leakage-induced transformation appears to distort the distance structure in a way that is unfavorable for KNN.</p>
<p>This result is particularly instructive because it challenges a common misconception:<br>
<strong>data leakage does not necessarily inflate performance metrics</strong>. Its effect depends on the interaction between the preprocessing step, the data distribution, and the model. What leakage <em>does</em> guarantee, however, is that the evaluation is no longer valid.</p>
<p>Even if the metrics had improved under this scenario, they could not be trusted as estimates of out-of-sample performance. The test data would no longer represent genuinely unseen observations, since information from their distribution had already been incorporated during preprocessing.</p>
<p>At this point, two important conclusions can be drawn:</p>
<ol type="1">
<li><p>Scaling decisions materially affect model behavior, especially for distance-based methods.</p></li>
<li><p>The timing of scaling—<em>when</em> parameters are learned—is as critical as <em>whether</em> scaling is applied at all.</p></li>
</ol>
<p>In the next scenario, we apply normalization correctly by estimating scaling parameters using the training data only and then applying them unchanged to the test set. This will provide the only defensible estimate of generalization performance among the normalization strategies considered.</p>
</section>
</section>
<section id="scenario-c-correct-normalization-train-only-scaling" class="level3" data-number="10.7">
<h3 data-number="10.7" class="anchored" data-anchor-id="scenario-c-correct-normalization-train-only-scaling"><span class="header-section-number">10.7</span> Scenario C — Correct Normalization (Train-Only Scaling)</h3>
<p>In this final preprocessing scenario, normalization parameters are learned <strong>exclusively from the training data</strong> and then applied consistently to both the training and test sets.</p>
<p>This workflow adheres to the core principle of leakage-free modeling.</p>
<section id="correct-pipeline-normalize-using-training-data-only" class="level4" data-number="10.7.1">
<h4 data-number="10.7.1" class="anchored" data-anchor-id="correct-pipeline-normalize-using-training-data-only"><span class="header-section-number">10.7.1</span> Correct Pipeline: Normalize Using Training Data Only</h4>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb18" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb18-1">rec_ok <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">recipe</span>(Sale_Price <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> ., <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data =</span> train_data) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb18-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">step_normalize</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">all_numeric_predictors</span>())</span>
<span id="cb18-3"></span>
<span id="cb18-4">wf_ok <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">workflow</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb18-5">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">add_recipe</span>(rec_ok) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb18-6">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">add_model</span>(knn_spec)</span>
<span id="cb18-7"></span>
<span id="cb18-8">fit_ok <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">fit</span>(wf_ok, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data =</span> train_data)</span>
<span id="cb18-9"></span>
<span id="cb18-10">pred_ok <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">predict</span>(fit_ok, test_data) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb18-11">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">bind_cols</span>(test_data <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> dplyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">select</span>(Sale_Price))</span>
<span id="cb18-12"></span>
<span id="cb18-13">metrics_ok <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> yardstick<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">metrics</span>(pred_ok, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">truth =</span> Sale_Price, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">estimate =</span> .pred)</span>
<span id="cb18-14"></span>
<span id="cb18-15">metrics_ok</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code># A tibble: 3 × 3
  .metric .estimator .estimate
  &lt;chr&gt;   &lt;chr&gt;          &lt;dbl&gt;
1 rmse    standard   35643.   
2 rsq     standard       0.816
3 mae     standard   23726.   </code></pre>
</div>
</div>
</section>
<section id="interpretation-2" class="level4" data-number="10.7.2">
<h4 data-number="10.7.2" class="anchored" data-anchor-id="interpretation-2"><span class="header-section-number">10.7.2</span> Interpretation</h4>
<p>This scenario represents the <strong>correct normalization workflow</strong>, where scaling parameters are learned exclusively from the training data and then applied unchanged to the test set. The results are <strong>identical to the no-scaling baseline</strong>. This finding is highly informative.</p>
<p>First, it confirms that normalization itself does not automatically improve model performance. When applied correctly, scaling does not inject additional information into the modeling process; it merely changes the representation of the data. If the underlying distance structure relevant for prediction is already dominated by certain predictors, scaling may have little to no effect on performance.</p>
<p>Second, the contrast with the leakage scenario is crucial. In Scenario B, incorrect normalization degraded performance, while in this scenario, correct normalization restores the metrics to their baseline levels. This symmetry reinforces the core message of this article:<br>
<strong>the validity of preprocessing matters more than the apparent gains it may produce.</strong></p>
<p>Third, these results highlight an often-overlooked point: the impact of scaling is model- and data-dependent. For this particular subset of predictors and this KNN configuration, normalization neither helps nor harms when applied correctly. In other settings—different feature sets, different distance metrics, or different models—the effect could be substantial.</p>
<p>The key takeaway is therefore not that scaling is unnecessary, but that it must be:</p>
<p>applied deliberately,</p>
<p>restricted to appropriate variables,</p>
<p>and learned at the correct stage of the modeling workflow.</p>
<p>With all three scenarios evaluated, we can now compare them side by side and distill the practical lessons they offer.</p>
</section>
</section>
<section id="results-comparison" class="level3" data-number="10.8">
<h3 data-number="10.8" class="anchored" data-anchor-id="results-comparison"><span class="header-section-number">10.8</span> Results Comparison</h3>
<p>With all three scenarios evaluated, we now compare them side by side. Since the model and data split were held constant, any differences observed here are entirely attributable to preprocessing choices.</p>
<section id="performance-summary" class="level4" data-number="10.8.1">
<h4 data-number="10.8.1" class="anchored" data-anchor-id="performance-summary"><span class="header-section-number">10.8.1</span> Performance Summary</h4>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb20" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb20-1">results_tbl <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> dplyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">bind_rows</span>(</span>
<span id="cb20-2">  metrics_none <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">scenario =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"A — No Scaling"</span>),</span>
<span id="cb20-3">  metrics_leak <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">scenario =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"B — Incorrect Scaling (Leakage)"</span>),</span>
<span id="cb20-4">  metrics_ok   <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">scenario =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"C — Correct Scaling (Train-Only)"</span>)</span>
<span id="cb20-5">) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb20-6">  dplyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">select</span>(scenario, .metric, .estimate) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb20-7">  tidyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pivot_wider</span>(</span>
<span id="cb20-8">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">names_from =</span> .metric,</span>
<span id="cb20-9">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">values_from =</span> .estimate</span>
<span id="cb20-10">  )</span>
<span id="cb20-11"></span>
<span id="cb20-12">results_tbl</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code># A tibble: 3 × 4
  scenario                           rmse   rsq    mae
  &lt;chr&gt;                             &lt;dbl&gt; &lt;dbl&gt;  &lt;dbl&gt;
1 A — No Scaling                   35643. 0.816 23726.
2 B — Incorrect Scaling (Leakage)  37036. 0.801 24411.
3 C — Correct Scaling (Train-Only) 35643. 0.816 23726.</code></pre>
</div>
</div>
<p>This table summarizes test-set performance across all scenarios.</p>
<ul>
<li><p><strong>Scenario A (No Scaling)</strong> serves as the baseline.</p></li>
<li><p><strong>Scenario B (Incorrect Scaling with Leakage)</strong> shows degraded performance.</p></li>
<li><p><strong>Scenario C (Correct Scaling)</strong> reproduces the baseline results exactly.</p></li>
</ul>
</section>
<section id="visual-comparison-rmse" class="level4" data-number="10.8.2">
<h4 data-number="10.8.2" class="anchored" data-anchor-id="visual-comparison-rmse"><span class="header-section-number">10.8.2</span> Visual Comparison (RMSE)</h4>
<p>To make the differences easier to interpret, we visualize RMSE across scenarios.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb22" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb22-1">results_tbl <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb22-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggplot</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> scenario, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> rmse, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">fill =</span> scenario)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb22-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_col</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">alpha =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.8</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb22-4"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">scale_fill_manual</span>(</span>
<span id="cb22-5"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">values =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(</span>
<span id="cb22-6"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"A — No Scaling"</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#1f77b4"</span>,</span>
<span id="cb22-7"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"B — Incorrect Scaling (Leakage)"</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#d62728"</span>,</span>
<span id="cb22-8"><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"C — Correct Scaling (Train-Only)"</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#2ca02c"</span></span>
<span id="cb22-9">)</span>
<span id="cb22-10">) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb22-11"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">labs</span>(</span>
<span id="cb22-12"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">title =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"RMSE comparison across preprocessing scenarios"</span>,</span>
<span id="cb22-13"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Preprocessing scenario"</span>,</span>
<span id="cb22-14"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"RMSE"</span></span>
<span id="cb22-15">) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb22-16"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme_minimal</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb22-17"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">legend.position =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"none"</span>)</span></code></pre></div></div>
<div class="cell-output-display">
<div>
<figure class="figure">
<p><img src="https://mfatihtuzen.github.io/posts/2026-01-02_normalization/index_files/figure-html/unnamed-chunk-11-1.png" class="img-fluid figure-img" width="672"></p>
</figure>
</div>
</div>
</div>
</section>
<section id="interpretation-3" class="level4" data-number="10.8.3">
<h4 data-number="10.8.3" class="anchored" data-anchor-id="interpretation-3"><span class="header-section-number">10.8.3</span> Interpretation</h4>
<p>Several important conclusions emerge from this comparison.</p>
<p>First, <strong>normalization does not inherently improve performance</strong>. When applied correctly (Scenario C), scaling neither improves nor degrades performance relative to the no-scaling baseline. This confirms that normalization is a representational transformation, not a source of predictive signal.</p>
<p>Second, <strong>incorrect normalization can be harmful</strong>. Scenario B demonstrates that learning scaling parameters from the full dataset can distort the feature space in ways that negatively affect model behavior. Even more importantly, this scenario yields an invalid evaluation, regardless of whether the metrics appear better or worse.</p>
<p>Third, these results reinforce a central theme of this article:<br>
<strong>the correctness of the preprocessing workflow matters more than the choice of preprocessing method itself</strong>.</p>
<p>In practice, this means that:</p>
<ul>
<li><p>scaling should be applied only when it aligns with the model’s assumptions,</p></li>
<li><p>preprocessing parameters must be learned exclusively from training data,</p></li>
<li><p>and any apparent performance gains should be scrutinized for potential leakage.</p></li>
</ul>
</section>
</section>
<section id="practical-takeaways-from-the-application" class="level3" data-number="10.9">
<h3 data-number="10.9" class="anchored" data-anchor-id="practical-takeaways-from-the-application"><span class="header-section-number">10.9</span> Practical Takeaways from the Application</h3>
<p>From this controlled experiment, we can distill three practical lessons:</p>
<ol type="1">
<li><p><strong>Do not expect normalization to be a silver bullet.</strong> Its impact depends on the model, the data, and the feature set.</p></li>
<li><p><strong>Never compromise the train–test boundary.</strong> Leakage can invalidate results even when performance does not improve.</p></li>
<li><p><strong>Treat preprocessing as part of the model.</strong> Decisions about scaling are modeling decisions, not technical afterthoughts.</p></li>
</ol>
<p>These lessons generalize beyond KNN and apply to any workflow involving scale-sensitive models and data transformations.</p>
</section>
</section>
<section id="discussion-and-conclusion" class="level2" data-number="11">
<h2 data-number="11" class="anchored" data-anchor-id="discussion-and-conclusion"><span class="header-section-number">11</span> Discussion and Conclusion</h2>
<p>Normalization is often introduced as a routine preprocessing step, applied almost reflexively before modeling. This article has argued—and demonstrated—that such a view is incomplete. Normalization is not a purely technical adjustment; it is a <strong>modeling decision</strong> whose consequences depend on the interaction between data, model assumptions, and evaluation design.</p>
<p>From a theoretical perspective, scaling matters because many learning algorithms are sensitive to the relative magnitudes of predictors. Distance-based methods, regularized models, kernel methods, and optimization-driven algorithms implicitly encode assumptions about scale. Ignoring these assumptions can distort model behavior, while respecting them can improve stability and interpretability. At the same time, scaling does not create new information. It reshapes how existing information is represented.</p>
<p>The empirical application using the Ames Housing dataset reinforced these points. By holding the model and data split constant and varying only the preprocessing strategy, we isolated the effect of normalization decisions. Three key findings emerged.</p>
<p>First, <strong>normalization does not guarantee performance improvements</strong>. In the correct workflow, scaling reproduced the baseline results exactly. This confirms that normalization should not be expected to “fix” a model by itself. Its role is conditional and context-dependent.</p>
<p>Second, <strong>incorrect normalization compromises validity</strong>. Learning scaling parameters from the full dataset—thereby introducing data leakage—altered model behavior and degraded performance in this example. More importantly, even if the metrics had improved, the evaluation would have been invalid. Leakage undermines the fundamental purpose of a test set: to approximate unseen data.</p>
<p>Third, <strong>the timing of preprocessing is as important as the method chosen</strong>. The difference between valid and invalid evaluation hinged not on whether scaling was applied, but on <em>when</em> its parameters were learned. This distinction is often overlooked in practice, yet it is central to trustworthy modeling.</p>
<p>Taken together, these results support a broader principle: preprocessing steps should be treated as integral components of the modeling pipeline, not as detached technical preliminaries. Decisions about normalization should be guided by model assumptions, data characteristics, and evaluation design—not by habit or generic checklists.</p>
<p>In practical terms, this leads to a simple but robust rule:</p>
<blockquote class="blockquote">
<p><strong>Split the data first. Learn preprocessing parameters from the training set only. Apply the same transformations to all future data.</strong></p>
</blockquote>
<p>Normalization, when used deliberately and correctly, is a powerful tool. When applied mechanically or at the wrong stage, it can mislead. Understanding this distinction is essential for building models that are not only accurate, but also scientifically defensible.</p>
<hr>
</section>
<section id="references" class="level2" data-number="12">
<h2 data-number="12" class="anchored" data-anchor-id="references"><span class="header-section-number">12</span> References</h2>
<ul>
<li><p>Hastie, T., Tibshirani, R., &amp; Friedman, J. (2009).<br>
<em>The Elements of Statistical Learning: Data Mining, Inference, and Prediction</em>. Springer.</p></li>
<li><p>Kuhn, M., &amp; Johnson, K. (2013).<br>
<em>Applied Predictive Modeling</em>. Springer.</p></li>
<li><p>Kuhn, M., &amp; Wickham, H. (2023).<br>
<em>Tidymodels: A Collection of Packages for Modeling and Machine Learning Using Tidyverse Principles</em>.<br>
<a href="https://www.tidymodels.org/" class="uri">https://www.tidymodels.org/</a></p></li>
<li><p>Tidymodels Recipes Documentation.<br>
<a href="https://recipes.tidymodels.org/" class="uri">https://recipes.tidymodels.org/</a></p></li>
<li><p>Kuhn, M. (Caret package documentation).<br>
<a href="https://topepo.github.io/caret/" class="uri">https://topepo.github.io/caret/</a></p></li>
<li><p>Modeldata package documentation (Ames Housing dataset).<br>
<a href="https://modeldata.tidymodels.org/reference/ames.html" class="uri">https://modeldata.tidymodels.org/reference/ames.html</a></p></li>
</ul>


<!-- -->

</section>

 ]]></description>
  <category>Data Preprocessing</category>
  <category>R Programming</category>
  <category>Data Science</category>
  <category>Machine Learning</category>
  <guid>https://mfatihtuzen.github.io/posts/2026-01-02_normalization/</guid>
  <pubDate>Fri, 02 Jan 2026 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Understanding Data Import and Export in R: Working with CSV and Excel Files</title>
  <dc:creator>M. Fatih Tüzen</dc:creator>
  <link>https://mfatihtuzen.github.io/posts/2025-12-26_import_export/</link>
  <description><![CDATA[ 






<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://mfatihtuzen.github.io/posts/2025-12-26_import_export/import_export.png" class="img-fluid quarto-figure quarto-figure-center figure-img" width="477"></p>
</figure>
</div>
<section id="introduction" class="level2" data-number="1">
<h2 data-number="1" class="anchored" data-anchor-id="introduction"><span class="header-section-number">1</span> Introduction</h2>
<p>When learning R, most people focus on functions, models, and visualizations. However, many real-world problems start much earlier — at the <strong>data import stage</strong> — and end much later — with <strong>exporting results</strong>.</p>
<p>If data is read incorrectly, no statistical method can save the analysis.</p>
<p>In this post, we focus on the <strong>logic of data import and export in R</strong>, using <strong>CSV and Excel files</strong>. Rather than memorizing functions, we build a mental model for how R interacts with files.</p>
</section>
<section id="why-data-import-and-export-matters" class="level2" data-number="2">
<h2 data-number="2" class="anchored" data-anchor-id="why-data-import-and-export-matters"><span class="header-section-number">2</span> Why Data Import and Export Matters</h2>
<p>Data analysis is a workflow:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode mathematica code-with-copy"><code class="sourceCode mathematica"><span id="cb1-1">Data source → <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">Import</span> → Analysis → Results → <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">Export</span> → Sharing</span></code></pre></div></div>
<p>Errors often occur at the <em>import</em> stage:</p>
<ul>
<li><p>wrong delimiters,</p></li>
<li><p>incorrect decimal separators,</p></li>
<li><p>incorrect file paths,</p></li>
<li><p>silently converted data types.</p></li>
</ul>
<p>The result?<br>
A model that runs perfectly — on the <strong>wrong data</strong>.</p>
</section>
<section id="csv-vs-excel-not-a-competition" class="level2" data-number="3">
<h2 data-number="3" class="anchored" data-anchor-id="csv-vs-excel-not-a-competition"><span class="header-section-number">3</span> CSV vs Excel: Not a Competition</h2>
<p>Before touching R, we should clarify the difference between file formats.</p>
<section id="csv-files" class="level3" data-number="3.1">
<h3 data-number="3.1" class="anchored" data-anchor-id="csv-files"><span class="header-section-number">3.1</span> CSV Files</h3>
<ul>
<li><p>Plain text files</p></li>
<li><p>Lightweight and fast</p></li>
<li><p>Universally supported</p></li>
<li><p>One table per file</p></li>
<li><p>No formatting, only data</p></li>
</ul>
<p>Example:</p>
<pre><code>total_bill,tip,sex
16.99,1.01,Female</code></pre>
</section>
<section id="excel-files" class="level3" data-number="3.2">
<h3 data-number="3.2" class="anchored" data-anchor-id="excel-files"><span class="header-section-number">3.2</span> Excel Files</h3>
<ul>
<li><p>Binary format (<code>.xlsx</code>)</p></li>
<li><p>Can contain multiple sheets</p></li>
<li><p>Store structure and presentation together</p></li>
<li><p>Widely used for reporting and sharing</p></li>
</ul>
<p><strong>Key idea:</strong><br>
CSV is a <em>data transport format</em>.<br>
Excel is a <em>communication format</em>.</p>
</section>
</section>
<section id="working-directory-where-r-actually-looks" class="level2" data-number="4">
<h2 data-number="4" class="anchored" data-anchor-id="working-directory-where-r-actually-looks"><span class="header-section-number">4</span> Working Directory: Where R Actually Looks</h2>
<p>One of the most common beginner mistakes has nothing to do with R syntax.</p>
<p>R does <strong>not</strong> search your entire computer for files. It only looks inside its <strong>working directory</strong>.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb3-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">getwd</span>()</span></code></pre></div></div>
</div>
<p>This command shows where R is currently looking.</p>
<p>If a file exists on your computer but not in this directory, R behaves as if the file does not exist.</p>
<p>This is why errors like:</p>
<pre class="pssql"><code>cannot open the connection</code></pre>
<p>usually indicate a <strong>path problem</strong>, not a coding problem.</p>
</section>
<section id="the-example-dataset-tips" class="level2" data-number="5">
<h2 data-number="5" class="anchored" data-anchor-id="the-example-dataset-tips"><span class="header-section-number">5</span> The Example Dataset: <code>tips</code></h2>
<p>Throughout this post, we use a single dataset: <strong>tips</strong>.</p>
<ul>
<li><p>Restaurant tipping data</p></li>
<li><p>Small and easy to understand</p></li>
<li><p>Contains numeric and categorical variables</p></li>
<li><p>Ideal for demonstrating import/export logic</p></li>
</ul>
<p>Data source:<br>
<a href="https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv" class="uri">https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv</a></p>
</section>
<section id="reading-csv-files-the-core-logic" class="level2" data-number="6">
<h2 data-number="6" class="anchored" data-anchor-id="reading-csv-files-the-core-logic"><span class="header-section-number">6</span> Reading CSV Files: The Core Logic</h2>
<p>When R reads a CSV file, it needs answers to four questions:</p>
<ol type="1">
<li><p>How are columns separated?</p></li>
<li><p>Is the first row a header?</p></li>
<li><p>What is the decimal separator?</p></li>
<li><p>How should text be interpreted?</p></li>
</ol>
<p>These answers are provided via <strong>function arguments</strong>.</p>
</section>
<section id="read.table-the-foundation" class="level2" data-number="7">
<h2 data-number="7" class="anchored" data-anchor-id="read.table-the-foundation"><span class="header-section-number">7</span> <code>read.table()</code>: The Foundation</h2>
<p>All CSV-reading functions in base R are built on <code>read.table()</code>.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb5-1">tips <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">read.table</span>(</span>
<span id="cb5-2">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">file =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"tips.csv"</span>,</span>
<span id="cb5-3">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">header =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>,</span>
<span id="cb5-4">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sep =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">","</span>,</span>
<span id="cb5-5">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">dec =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"."</span>,</span>
<span id="cb5-6">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">stringsAsFactors =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span></span>
<span id="cb5-7">)</span></code></pre></div></div>
</div>
<p>Understanding this function means understanding CSV import in R.</p>
</section>
<section id="read.csv-and-its-assumptions" class="level2" data-number="8">
<h2 data-number="8" class="anchored" data-anchor-id="read.csv-and-its-assumptions"><span class="header-section-number">8</span> <code>read.csv()</code> and Its Assumptions</h2>
<p><code>read.csv()</code> is simply a shortcut for a common case:</p>
<ul>
<li><p>Columns separated by commas</p></li>
<li><p>Decimal separator is a dot</p></li>
</ul>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb6-1">tips <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">read.csv</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"tips.csv"</span>)</span></code></pre></div></div>
</div>
<p>This works perfectly — <strong>if the assumptions match the file</strong>.</p>
<p>The dangerous part? R may not throw an error even if the assumptions are wrong.</p>
<blockquote class="blockquote">
<p>The most dangerous errors are silent ones.</p>
</blockquote>
</section>
<section id="read.csv2-and-regional-differences" class="level2" data-number="9">
<h2 data-number="9" class="anchored" data-anchor-id="read.csv2-and-regional-differences"><span class="header-section-number">9</span> <code>read.csv2()</code> and Regional Differences</h2>
<p>In many European datasets:</p>
<ul>
<li><p>Columns are separated by semicolons</p></li>
<li><p>Decimals use commas</p></li>
</ul>
<pre><code>total_bill;tip;sex
16,99;1,01;Female</code></pre>
<p>For this structure, <code>read.csv2()</code> is designed.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb8-1">tips2 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">read.csv2</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"tips_semicolon.csv"</span>)</span></code></pre></div></div>
</div>
<p>Important nuance:<br>
Even if decimals use dots, <code>read.csv2()</code> may still work in some cases — but <strong>this is not guaranteed</strong>.</p>
<p>Correct approach:</p>
<blockquote class="blockquote">
<p>Always inspect the file structure before choosing the function.</p>
</blockquote>
</section>
<section id="writing-csv-files-from-r" class="level2" data-number="10">
<h2 data-number="10" class="anchored" data-anchor-id="writing-csv-files-from-r"><span class="header-section-number">10</span> Writing CSV Files from R</h2>
<p>Data analysis rarely ends in R. Results are shared as files.</p>
<section id="writing-comma-separated-csv" class="level3" data-number="10.1">
<h3 data-number="10.1" class="anchored" data-anchor-id="writing-comma-separated-csv"><span class="header-section-number">10.1</span> Writing comma-separated CSV</h3>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb9-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">write.csv</span>(tips, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"tips_comma.csv"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">row.names =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span>)</span></code></pre></div></div>
</div>
</section>
<section id="writing-semicolon-separated-csv" class="level3" data-number="10.2">
<h3 data-number="10.2" class="anchored" data-anchor-id="writing-semicolon-separated-csv"><span class="header-section-number">10.2</span> Writing semicolon-separated CSV</h3>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb10" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb10-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">write.csv2</span>(tips, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"tips_semicolon.csv"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">row.names =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span>)</span></code></pre></div></div>
</div>
<p>Choosing the correct format depends on <strong>who will read the file next</strong>.</p>
</section>
</section>
<section id="why-we-still-need-excel" class="level2" data-number="11">
<h2 data-number="11" class="anchored" data-anchor-id="why-we-still-need-excel"><span class="header-section-number">11</span> Why We Still Need Excel</h2>
<p>CSV is technically superior in many ways. Yet Excel remains dominant in practice.</p>
<p>Why?</p>
<ul>
<li><p>Multiple tables in one file</p></li>
<li><p>Familiar interface for non-technical users</p></li>
<li><p>Common reporting format</p></li>
</ul>
<p>Excel is not an analysis tool — but it <em>is</em> a powerful delivery tool.</p>
</section>
<section id="working-with-excel-in-r-openxlsx" class="level2" data-number="12">
<h2 data-number="12" class="anchored" data-anchor-id="working-with-excel-in-r-openxlsx"><span class="header-section-number">12</span> Working with Excel in R: <code>openxlsx</code></h2>
<p>The <code>openxlsx</code> package allows Excel operations <strong>without requiring Excel itself</strong>.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb11" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb11-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(openxlsx)</span></code></pre></div></div>
</div>
<section id="writing-a-simple-excel-file" class="level3" data-number="12.1">
<h3 data-number="12.1" class="anchored" data-anchor-id="writing-a-simple-excel-file"><span class="header-section-number">12.1</span> Writing a simple Excel file</h3>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb12" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb12-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">write.xlsx</span>(tips, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"tips.xlsx"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sheetName =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"tips"</span>)</span></code></pre></div></div>
</div>
</section>
<section id="reading-from-excel" class="level3" data-number="12.2">
<h3 data-number="12.2" class="anchored" data-anchor-id="reading-from-excel"><span class="header-section-number">12.2</span> Reading from Excel</h3>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb13" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb13-1">tips_excel <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">read.xlsx</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"tips.xlsx"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sheet =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)</span></code></pre></div></div>
</div>
</section>
</section>
<section id="multiple-sheets-a-mini-report" class="level2" data-number="13">
<h2 data-number="13" class="anchored" data-anchor-id="multiple-sheets-a-mini-report"><span class="header-section-number">13</span> Multiple Sheets: A Mini Report</h2>
<p>Excel shines when organizing related tables.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb14" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb14-1">summary_tips <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aggregate</span>(tip <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> day, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data =</span> tips, mean)</span>
<span id="cb14-2"></span>
<span id="cb14-3">wb <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">createWorkbook</span>()</span>
<span id="cb14-4"></span>
<span id="cb14-5"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">addWorksheet</span>(wb, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Raw Data"</span>)</span>
<span id="cb14-6"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">writeData</span>(wb, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Raw Data"</span>, tips)</span>
<span id="cb14-7"></span>
<span id="cb14-8"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">addWorksheet</span>(wb, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Summary"</span>)</span>
<span id="cb14-9"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">writeData</span>(wb, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Summary"</span>, summary_tips)</span>
<span id="cb14-10"></span>
<span id="cb14-11"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">saveWorkbook</span>(wb, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"tips_report.xlsx"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">overwrite =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>)</span></code></pre></div></div>
</div>
<p>One file.<br>
</p>
<p>Multiple views.<br>
</p>
<p>Clean structure.</p>
</section>
<section id="common-mistakes-to-watch-for" class="level2" data-number="14">
<h2 data-number="14" class="anchored" data-anchor-id="common-mistakes-to-watch-for"><span class="header-section-number">14</span> Common Mistakes to Watch For</h2>
<p>Most errors are not caused by R, but by assumptions:</p>
<ul>
<li><p>Incorrect working directory</p></li>
<li><p>Wrong delimiter (<code>sep</code>)</p></li>
<li><p>Wrong decimal separator (<code>dec</code>)</p></li>
<li><p>Reading the wrong Excel sheet</p></li>
<li><p>Overwriting files unintentionally</p></li>
</ul>
<p>A healthy habit after every import:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb15" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb15-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">head</span>(data)</span>
<span id="cb15-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">str</span>(data)</span>
<span id="cb15-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summary</span>(data)</span></code></pre></div></div>
</div>
</section>
<section id="final-thoughts" class="level2" data-number="15">
<h2 data-number="15" class="anchored" data-anchor-id="final-thoughts"><span class="header-section-number">15</span> Final Thoughts</h2>
<p>If you can:</p>
<ul>
<li><p>read data correctly,</p></li>
<li><p>write data consciously,</p></li>
<li><p>choose file formats intentionally,</p></li>
</ul>
<p>you have already crossed one of the most important thresholds in data analysis.</p>
<p>For a complementary discussion, you may also find this article useful:<br>
<a href="https://medium.com/p/e730f4a84b3b" class="uri">https://medium.com/p/e730f4a84b3b</a></p>
<hr>
<p><strong>Extended version on Medium:</strong><br>
<a href="https://medium.com/@Fatih.Tuzen/understanding-data-import-and-export-in-r-working-with-csv-and-excel-files-6322e61049b2">https://medium.com/@Fatih.Tuzen/understanding-data-import-and-export-in-r-working-with-csv-and-excel-files-6322e61049b2</a></p>


<!-- -->

</section>

 ]]></description>
  <category>R Programming</category>
  <category>Data Analysis</category>
  <category>Data Science</category>
  <category>CSV</category>
  <category>Excel</category>
  <category>Data Import</category>
  <category>Data Export</category>
  <guid>https://mfatihtuzen.github.io/posts/2025-12-26_import_export/</guid>
  <pubDate>Fri, 26 Dec 2025 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Outliers in Data Analysis: Detecting Extreme Values Before Modeling in R with İstanbul Airbnb Data</title>
  <dc:creator>M. Fatih Tüzen</dc:creator>
  <link>https://mfatihtuzen.github.io/posts/2025-12-19_outliers/</link>
  <description><![CDATA[ 






<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://mfatihtuzen.github.io/posts/2025-12-19_outliers/outliers.png" class="img-fluid quarto-figure quarto-figure-center figure-img" width="484"></p>
</figure>
</div>
<section id="introduction" class="level2" data-number="1">
<h2 data-number="1" class="anchored" data-anchor-id="introduction"><span class="header-section-number">1</span> Introduction</h2>
<p>Data preprocessing is often presented as a sequence of technical steps. However, each preprocessing decision implicitly embeds a statistical assumption.</p>
<p>In a previous article, I discussed how missing observations can bias analysis if they are ignored or handled improperly:</p>
<p><a href="https://medium.com/r-evolution/handling-missing-data-in-r-a-comprehensive-guide-eca195eaead3"><strong>Handling Missing Data in R: A Comprehensive Guide</strong></a></p>
<p>This article continues that discussion by focusing on <strong>outliers</strong>. Unlike missing values, outliers are observed data points. The challenge is not their absence, but their <em>extremeness</em>.</p>
<p>Understanding whether an extreme value is informative or misleading is a crucial step before any modeling effort.</p>
</section>
<section id="why-outliers-matter" class="level2" data-number="2">
<h2 data-number="2" class="anchored" data-anchor-id="why-outliers-matter"><span class="header-section-number">2</span> Why Outliers Matter</h2>
<p>Outliers can affect statistical analysis in several fundamental ways:</p>
<ul>
<li>They distort summary statistics such as the mean and standard deviation</li>
<li>They can dominate parameter estimates in regression models</li>
<li>They influence distance-based methods such as clustering</li>
</ul>
<p>More importantly, outliers force analysts to confront a key question:</p>
<blockquote class="blockquote">
<p>Are we observing rare but valid behavior, or a deviation from the assumed data-generating process?</p>
</blockquote>
</section>
<section id="what-is-an-outlier" class="level2" data-number="3">
<h2 data-number="3" class="anchored" data-anchor-id="what-is-an-outlier"><span class="header-section-number">3</span> What Is an Outlier?</h2>
<p>Informally, an outlier is an observation that appears unusually large or small relative to the rest of the data. Formally, an outlier is an observation that is inconsistent with the bulk of the data <strong>under a given statistical model</strong>. Outliers are therefore not absolute objects. They depend on assumptions about distribution, scale, and structure.</p>
</section>
<section id="the-dataset-inside-airbnb-listings-istanbul" class="level2" data-number="4">
<h2 data-number="4" class="anchored" data-anchor-id="the-dataset-inside-airbnb-listings-istanbul"><span class="header-section-number">4</span> The Dataset: Inside Airbnb Listings (Istanbul)</h2>
<p>To demonstrate outlier detection methods, we will use <strong>Inside Airbnb</strong> listings data. Inside Airbnb is a mission-driven project that publishes datasets scraped from publicly available Airbnb listing pages and provides city-level downloads for research and analysis.</p>
<p>In this article, we will work with the <strong>detailed listings</strong> file:</p>
<ul>
<li><code>listings.csv.gz</code> (detailed listing-level data; typically rich and feature-complete)</li>
</ul>
<p>You can download the dataset from the <a href="https://insideairbnb.com/get-the-data/">Inside Airbnb</a> “Get the Data” page (choose a city and download <em>Detailed Listings data</em>).</p>
<section id="why-this-dataset-is-ideal-for-outlier-detection" class="level3" data-number="4.1">
<h3 data-number="4.1" class="anchored" data-anchor-id="why-this-dataset-is-ideal-for-outlier-detection"><span class="header-section-number">4.1</span> Why this dataset is ideal for outlier detection</h3>
<p>Unlike many “clean” educational datasets, Airbnb listing data often contains <strong>genuinely extreme values</strong>, especially in <code>price</code>. These extremes are not necessarily errors—luxury properties exist—but they can heavily distort means, variances, and model estimates. That makes Airbnb listings a realistic and highly instructive dataset for outlier detection.</p>
</section>
<section id="variables-we-will-use" class="level3" data-number="4.2">
<h3 data-number="4.2" class="anchored" data-anchor-id="variables-we-will-use"><span class="header-section-number">4.2</span> Variables we will use</h3>
<p>Although the Airbnb listings dataset contains many variables, this article focuses on a <strong>small, purpose-driven subset</strong>.</p>
<p>Our primary variable of interest is:</p>
<ul>
<li><code>price</code> (converted to <code>price_num</code>): nightly listing price.<br>
This variable is typically right-skewed and often contains extreme values, making it ideal for illustrating outlier detection methods.</li>
</ul>
<p>To provide context for interpreting extreme prices, we also retain a limited number of supporting variables:</p>
<ul>
<li><code>minimum_nights</code>: minimum stay requirement, which can occasionally take unusually large values</li>
<li><code>number_of_reviews</code>: a proxy for listing activity and popularity, often zero-inflated</li>
<li><code>room_type</code>: categorical variable indicating the type of accommodation</li>
<li><code>neighbourhood_cleansed</code>: cleaned neighborhood label, useful for geographic context</li>
</ul>
<p>These additional variables are not used to <em>detect</em> outliers directly, but to <strong>interpret and explain</strong> them once identified.</p>
</section>
<section id="loading-the-data-in-r" class="level3" data-number="4.3">
<h3 data-number="4.3" class="anchored" data-anchor-id="loading-the-data-in-r"><span class="header-section-number">4.3</span> Loading the data in R</h3>
<p>Below is an example using <strong>Istanbul</strong>. If you prefer a different city, replace the URL with the corresponding <code>listings.csv.gz</code> link from Inside Airbnb.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(dplyr)</span>
<span id="cb1-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(readr)</span>
<span id="cb1-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(stringr)</span>
<span id="cb1-4"></span>
<span id="cb1-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Example URL (Istanbul). You can get the latest link from Inside Airbnb "Get the Data".</span></span>
<span id="cb1-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># The URL structure typically follows:</span></span>
<span id="cb1-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># https://data.insideairbnb.com/turkey/marmara/istanbul/2025-09-29/data/listings.csv.gz</span></span>
<span id="cb1-8"></span>
<span id="cb1-9">listings_raw <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">read_csv</span>(</span>
<span id="cb1-10">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"listings.csv.gz"</span>,</span>
<span id="cb1-11">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">show_col_types =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span></span>
<span id="cb1-12">)</span>
<span id="cb1-13"></span>
<span id="cb1-14">vars_keep <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(</span>
<span id="cb1-15">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"id"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"name"</span>,</span>
<span id="cb1-16">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"price"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"minimum_nights"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"number_of_reviews"</span>,</span>
<span id="cb1-17">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"room_type"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"neighbourhood_cleansed"</span></span>
<span id="cb1-18">)</span>
<span id="cb1-19"></span>
<span id="cb1-20">listings_small <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> listings_raw <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb1-21">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">select</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">any_of</span>(vars_keep))</span></code></pre></div></div>
</div>
</section>
<section id="inspecting-the-selected-variables" class="level3" data-number="4.4">
<h3 data-number="4.4" class="anchored" data-anchor-id="inspecting-the-selected-variables"><span class="header-section-number">4.4</span> Inspecting the selected variables</h3>
<p>Before performing any transformation, it is important to inspect the data <strong>as it comes from the source</strong>. This allows us to understand variable types and identify potential issues early.</p>
<p>Below, we examine only the variables selected for this article.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb2-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">glimpse</span>(listings_small)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>Rows: 30,051
Columns: 7
$ id                     &lt;dbl&gt; 1.342043e+18, 1.342082e+18, 1.342211e+18, 1.342…
$ name                   &lt;chr&gt; "Отдельная квартира на Фатих(Балат).", "Blue st…
$ price                  &lt;chr&gt; "$2,290.00", "$1,101.00", "$3,430.00", "$3,178.…
$ minimum_nights         &lt;dbl&gt; 5, 7, 2, 100, 1, 100, 1, 5, 2, 100, 100, 2, 100…
$ number_of_reviews      &lt;dbl&gt; 4, 4, 26, 1, 2, 0, 41, 0, 19, 0, 0, 26, 0, 0, 1…
$ room_type              &lt;chr&gt; "Entire home/apt", "Private room", "Entire home…
$ neighbourhood_cleansed &lt;chr&gt; "Fatih", "Beyoglu", "Beyoglu", "Sisli", "Sisli"…</code></pre>
</div>
</div>
<p>At this stage, notice in particular the <code>price</code> variable. Although it represents a numerical concept (nightly price), it is not stored as a numeric variable.</p>
<p>Instead, <code>price</code> is typically read as a character string, often containing currency symbols and separators. This is common in datasets that originate from web scraping or user-facing platforms.</p>
</section>
<section id="why-we-need-to-convert-price-to-numeric" class="level3" data-number="4.5">
<h3 data-number="4.5" class="anchored" data-anchor-id="why-we-need-to-convert-price-to-numeric"><span class="header-section-number">4.5</span> Why we need to convert <code>price</code> to numeric</h3>
<p>Outlier detection methods such as boxplots, the IQR rule, and Z-scores require <strong>numeric input</strong>. As long as <code>price</code> is stored as a character variable, it cannot be used in quantitative analysis.</p>
<p>More importantly, treating <code>price</code> as numeric is not just a technical requirement. It reflects a modeling decision: we explicitly state that this variable represents a measurable quantity on which arithmetic operations are meaningful.</p>
</section>
<section id="converting-price-to-a-numeric-variable" class="level3" data-number="4.6">
<h3 data-number="4.6" class="anchored" data-anchor-id="converting-price-to-a-numeric-variable"><span class="header-section-number">4.6</span> Converting <code>price</code> to a numeric variable</h3>
<p>To prepare the data for analysis, we remove non-numeric characters and convert <code>price</code> to a numeric variable, which we call <code>price_num</code>.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb4-1">listings_small <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> listings_small <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb4-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">price_num =</span> price <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb4-3">           <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">str_replace_all</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"[^0-9.]"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">""</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb4-4">           <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">as.numeric</span>())</span></code></pre></div></div>
</div>
<p>After the conversion, we can verify the result by inspecting basic summaries:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb5-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summary</span>(listings_small<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>price_num)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
     80    1644    2538    5084    4108 4437598    4803 </code></pre>
</div>
</div>
<p>At this point, <code>price_num</code> is ready for outlier detection and visualization. In the next section, we will use this variable to illustrate how extreme values can be identified using visual tools and formal statistical rules.</p>
</section>
</section>
<section id="visualizing-price-distributions-and-potential-outliers" class="level2" data-number="5">
<h2 data-number="5" class="anchored" data-anchor-id="visualizing-price-distributions-and-potential-outliers"><span class="header-section-number">5</span> Visualizing Price Distributions and Potential Outliers</h2>
<p>Before applying any formal outlier detection rule, it is good practice to explore the distribution of the variable visually. Visualization helps us understand the <em>shape</em>, <em>spread</em>, and <em>asymmetry</em> of the data, and often reveals extreme values immediately.</p>
<p>In this section, we focus on the numeric price variable <code>price_num</code>.</p>
<section id="a-first-attempt-why-the-raw-histogram-fails" class="level3" data-number="5.1">
<h3 data-number="5.1" class="anchored" data-anchor-id="a-first-attempt-why-the-raw-histogram-fails"><span class="header-section-number">5.1</span> A first attempt: why the raw histogram fails</h3>
<p>A natural first step is to plot a histogram of nightly prices on the original scale.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb7-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(ggplot2)</span>
<span id="cb7-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(scales)</span>
<span id="cb7-3"></span>
<span id="cb7-4"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggplot</span>(listings_small, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> price_num)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb7-5">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_histogram</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">bins =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">40</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">fill =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#B0B0B0"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">color =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"white"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb7-6">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">scale_x_continuous</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">labels =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">label_number</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">big.mark =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">","</span>)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb7-7">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">labs</span>(</span>
<span id="cb7-8">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">title =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Distribution of Nightly Prices (raw scale)"</span>,</span>
<span id="cb7-9">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Nightly price"</span>,</span>
<span id="cb7-10">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Count"</span></span>
<span id="cb7-11">  ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb7-12">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme_minimal</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">base_size =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">13</span>)</span></code></pre></div></div>
<div class="cell-output-display">
<div>
<figure class="figure">
<p><img src="https://mfatihtuzen.github.io/posts/2025-12-19_outliers/index_files/figure-html/unnamed-chunk-5-1.png" class="img-fluid figure-img" width="672"></p>
</figure>
</div>
</div>
</div>
<p><strong>Interpretation</strong></p>
<p>This plot is technically correct, but analytically unhelpful.</p>
<ul>
<li><p>A small number of extremely expensive listings stretches the x-axis.</p></li>
<li><p>The majority of observations are compressed near zero.</p></li>
<li><p>As a result, the internal structure of the data becomes almost invisible.</p></li>
</ul>
<p>This is not a plotting mistake. It is a direct consequence of <strong>heavy right-skewness</strong>, which is common in price data. At this point, it is already clear that naive visualizations on the raw scale are insufficient.</p>
</section>
<section id="adding-context-prices-depend-on-room_type" class="level3" data-number="5.2">
<h3 data-number="5.2" class="anchored" data-anchor-id="adding-context-prices-depend-on-room_type"><span class="header-section-number">5.2</span> Adding context: prices depend on <code>room_type</code></h3>
<p>Airbnb listings are not drawn from a single homogeneous market. A <em>shared room</em> and an <em>entire home/apt</em> represent fundamentally different accommodation types, and their prices should not be expected to follow the same distribution.</p>
<p>If we ignore this context and search for outliers globally, we risk labeling valid group-level differences as anomalies. For this reason, we first examine how prices behave <strong>within each room type</strong>.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb8-1">listings_small <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb8-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">count</span>(room_type, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sort =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code># A tibble: 4 × 2
  room_type           n
  &lt;chr&gt;           &lt;int&gt;
1 Entire home/apt 20243
2 Private room     9494
3 Hotel room        157
4 Shared room       157</code></pre>
</div>
</div>
</section>
<section id="price-distributions-by-room-type-log-scale" class="level3" data-number="5.3">
<h3 data-number="5.3" class="anchored" data-anchor-id="price-distributions-by-room-type-log-scale"><span class="header-section-number">5.3</span> Price distributions by room type (log scale)</h3>
<p>To make the right tail interpretable without discarding extreme values, we visualize prices on a logarithmic scale and separate distributions by room type.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb10" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb10-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggplot</span>(listings_small, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> price_num)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb10-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_histogram</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">bins =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">35</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">fill =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#4C72B0"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">color =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"white"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">alpha =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.9</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb10-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">scale_x_log10</span>(</span>
<span id="cb10-4"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">breaks =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">log_breaks</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">n =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">6</span>),</span>
<span id="cb10-5"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">labels =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">label_number</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">big.mark =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">","</span>)</span>
<span id="cb10-6">) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb10-7"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">facet_wrap</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> room_type, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">scales =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"free_y"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb10-8"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">labs</span>(</span>
<span id="cb10-9"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">title =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Nightly Price Distributions by Room Type (log scale)"</span>,</span>
<span id="cb10-10"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">subtitle =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Log scale improves readability in heavily right-skewed price data"</span>,</span>
<span id="cb10-11"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Nightly price (log scale)"</span>,</span>
<span id="cb10-12"><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Count"</span></span>
<span id="cb10-13">) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb10-14"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme_minimal</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">base_size =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">13</span>)</span></code></pre></div></div>
<div class="cell-output-display">
<div>
<figure class="figure">
<p><img src="https://mfatihtuzen.github.io/posts/2025-12-19_outliers/index_files/figure-html/unnamed-chunk-7-1.png" class="img-fluid figure-img" width="672"></p>
</figure>
</div>
</div>
</div>
<p><strong>Interpretation</strong></p>
<p>This visualization reveals several important patterns:</p>
<ul>
<li><p>Each room type has its own characteristic price range.</p></li>
<li><p>The extreme right tail becomes visible without overwhelming the plot.</p></li>
<li><p>What appears as an “outlier” globally may be perfectly typical within a given room type</p></li>
</ul>
<p>At this stage, the notion of an outlier becomes <strong>context-dependent</strong> rather than absolute.</p>
</section>
<section id="boxplots-by-room-type-highlighting-potential-extremes" class="level3" data-number="5.4">
<h3 data-number="5.4" class="anchored" data-anchor-id="boxplots-by-room-type-highlighting-potential-extremes"><span class="header-section-number">5.4</span> Boxplots by room type: highlighting potential extremes</h3>
<p>Histograms show overall shape, but boxplots are better suited for highlighting extreme observations. We again use a log scale to preserve readability.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb11" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb11-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggplot</span>(listings_small, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> room_type, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> price_num)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb11-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_boxplot</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">outlier.alpha =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.35</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">fill =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#DDDDDD"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb11-3">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">scale_y_log10</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">labels =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">label_number</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">big.mark =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">","</span>)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb11-4">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">labs</span>(</span>
<span id="cb11-5">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">title =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Nightly Prices by Room Type (boxplot, log scale)"</span>,</span>
<span id="cb11-6">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">subtitle =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Potential outliers are assessed within each room type"</span>,</span>
<span id="cb11-7">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Room type"</span>,</span>
<span id="cb11-8">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Nightly price (log scale)"</span></span>
<span id="cb11-9">  ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb11-10">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme_minimal</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">base_size =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">13</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb11-11">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">axis.text.x =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">element_text</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">angle =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">20</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">hjust =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>))</span></code></pre></div></div>
<div class="cell-output-display">
<div>
<figure class="figure">
<p><img src="https://mfatihtuzen.github.io/posts/2025-12-19_outliers/index_files/figure-html/unnamed-chunk-8-1.png" class="img-fluid figure-img" width="672"></p>
</figure>
</div>
</div>
</div>
<p><strong>Interpretation</strong></p>
<p>This plot makes a key point explicit:</p>
<ul>
<li><p>Outliers are flagged <strong>relative to their own room type</strong>, not the entire dataset.</p></li>
<li><p>Extremely high prices within <em>shared rooms</em> are <strong>statistically more unusual</strong> than similarly high prices within <em>entire homes</em>, given the much narrower price distribution of shared rooms.</p></li>
<li><p>Statistical outliers are candidates for further investigation, not automatic deletions.</p></li>
</ul>
</section>
<section id="what-visual-exploration-tells-us" class="level3" data-number="5.5">
<h3 data-number="5.5" class="anchored" data-anchor-id="what-visual-exploration-tells-us"><span class="header-section-number">5.5</span> What visual exploration tells us</h3>
<p>From visual inspection alone, we can conclude that:</p>
<ul>
<li><p>Airbnb price data are highly right-skewed.</p></li>
<li><p>Extreme values exist and strongly influence scale and summaries.</p></li>
<li><p>Context (here, <code>room_type</code>) is essential for meaningful interpretation.</p></li>
</ul>
<p>These observations motivate the next step: formalizing outlier detection using statistical rules such as the <strong>IQR method</strong> and <strong>Z-scores</strong>, applied <em>within room types</em> rather than globally.</p>
</section>
</section>
<section id="formal-outlier-detection-within-room-type" class="level2" data-number="6">
<h2 data-number="6" class="anchored" data-anchor-id="formal-outlier-detection-within-room-type"><span class="header-section-number">6</span> Formal Outlier Detection Within Room Type</h2>
<p>Visual exploration suggested that nightly prices exhibit strong right-skewness and that extreme values should be interpreted within the context of <code>room_type</code>. In this section, we formalize that intuition using statistical outlier detection rules.</p>
<p>Our goal is not to mechanically remove observations, but to <strong>identify and examine</strong> listings whose prices are unusually high relative to their own room type.</p>
<section id="the-iqr-rule" class="level3" data-number="6.1">
<h3 data-number="6.1" class="anchored" data-anchor-id="the-iqr-rule"><span class="header-section-number">6.1</span> The IQR rule</h3>
<p>The Interquartile Range (IQR) rule defines outliers based on the spread of the middle 50% of the data. For a given variable, the IQR is defined as:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Ctext%7BIQR%7D%20=%20Q_3%20-%20Q_1%0A"></p>
<p>An observation is flagged as a potential outlier if it lies outside the interval:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5B%20Q_1%20-%201.5%20%5Ctimes%20%5Ctext%7BIQR%7D,%20%5C;%20Q_3%20+%201.5%20%5Ctimes%20%5Ctext%7BIQR%7D%20%5D%0A"></p>
<p>Because the IQR relies on quantiles rather than the mean and standard deviation, it is relatively robust to skewed distributions—an important property for price data.</p>
</section>
<section id="applying-the-iqr-rule-within-each-room-type" class="level3" data-number="6.2">
<h3 data-number="6.2" class="anchored" data-anchor-id="applying-the-iqr-rule-within-each-room-type"><span class="header-section-number">6.2</span> Applying the IQR rule within each room type</h3>
<p>Instead of computing a single global IQR, we apply the rule <strong>separately within each room type</strong>. This ensures that prices are evaluated relative to comparable listings.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb12" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb12-1">outliers_iqr <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> listings_small <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb12-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">group_by</span>(room_type) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb12-3">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(</span>
<span id="cb12-4">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">Q1 =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">quantile</span>(price_num, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.25</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">na.rm =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>),</span>
<span id="cb12-5">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">Q3 =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">quantile</span>(price_num, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.75</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">na.rm =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>),</span>
<span id="cb12-6">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">IQR_value =</span> Q3 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> Q1,</span>
<span id="cb12-7">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">lower_bound =</span> Q1 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.5</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> IQR_value,</span>
<span id="cb12-8">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">upper_bound =</span> Q3 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.5</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> IQR_value,</span>
<span id="cb12-9">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">outlier_iqr =</span> price_num <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> lower_bound <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|</span> price_num <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> upper_bound</span>
<span id="cb12-10">  ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb12-11">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ungroup</span>()</span></code></pre></div></div>
</div>
<p>At this stage, each listing is labeled according to whether its price is considered an outlier <em>within its own room type</em>.</p>
</section>
<section id="how-many-outliers-do-we-detect" class="level3" data-number="6.3">
<h3 data-number="6.3" class="anchored" data-anchor-id="how-many-outliers-do-we-detect"><span class="header-section-number">6.3</span> How many outliers do we detect?</h3>
<p>Before inspecting individual listings, it is informative to summarize how many outliers are flagged in each group.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb13" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb13-1">outliers_iqr <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb13-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">count</span>(room_type, outlier_iqr) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb13-3">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">arrange</span>(room_type, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">desc</span>(outlier_iqr))</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code># A tibble: 12 × 3
   room_type       outlier_iqr     n
   &lt;chr&gt;           &lt;lgl&gt;       &lt;int&gt;
 1 Entire home/apt TRUE         1458
 2 Entire home/apt FALSE       16616
 3 Entire home/apt NA           2169
 4 Hotel room      TRUE           17
 5 Hotel room      FALSE         106
 6 Hotel room      NA             34
 7 Private room    TRUE          441
 8 Private room    FALSE        6470
 9 Private room    NA           2583
10 Shared room     TRUE            3
11 Shared room     FALSE         137
12 Shared room     NA             17</code></pre>
</div>
</div>
<p><strong>Interpretation</strong></p>
<p>This table shows that outliers are not evenly distributed across room types. Some categories naturally exhibit greater price dispersion, which leads to more listings being flagged as potential outliers. This reinforces the importance of <strong>group-aware detection</strong>.</p>
</section>
<section id="inspecting-extreme-cases-flagged-by-iqr" class="level3" data-number="6.4">
<h3 data-number="6.4" class="anchored" data-anchor-id="inspecting-extreme-cases-flagged-by-iqr"><span class="header-section-number">6.4</span> Inspecting extreme cases flagged by IQR</h3>
<p>Statistical flags become meaningful only when we inspect the actual observations. Below, we list the most expensive listings flagged as outliers within each room type.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb15" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb15-1">top_price_outliers <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> outliers_iqr <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb15-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(outlier_iqr) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb15-3">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">group_by</span>(room_type) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb15-4">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">arrange</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">desc</span>(price_num)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb15-5">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">slice_head</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">n =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb15-6">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ungroup</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb15-7">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">select</span>(</span>
<span id="cb15-8">    room_type,</span>
<span id="cb15-9">    price,</span>
<span id="cb15-10">    price_num,</span>
<span id="cb15-11">    minimum_nights,</span>
<span id="cb15-12">    number_of_reviews,</span>
<span id="cb15-13">    neighbourhood_cleansed,</span>
<span id="cb15-14">    name</span>
<span id="cb15-15">  )</span>
<span id="cb15-16"></span>
<span id="cb15-17">top_price_outliers</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code># A tibble: 18 × 7
   room_type       price         price_num minimum_nights number_of_reviews
   &lt;chr&gt;           &lt;chr&gt;             &lt;dbl&gt;          &lt;dbl&gt;             &lt;dbl&gt;
 1 Entire home/apt $4,437,598.00   4437598            100                14
 2 Entire home/apt $2,658,600.00   2658600            100                 3
 3 Entire home/apt $2,109,690.00   2109690            100                 0
 4 Entire home/apt $2,000,000.00   2000000            100                 0
 5 Entire home/apt $1,250,008.00   1250008            100                 0
 6 Hotel room      $2,439,497.00   2439497              1                 0
 7 Hotel room      $2,439,497.00   2439497              1                 0
 8 Hotel room      $2,439,497.00   2439497              1                 0
 9 Hotel room      $2,439,497.00   2439497              1                 0
10 Hotel room      $2,433,427.00   2433427              1                 0
11 Private room    $390,271.00      390271            365                 1
12 Private room    $390,271.00      390271            365                 0
13 Private room    $390,271.00      390271            365                 0
14 Private room    $390,271.00      390271            100                 0
15 Private room    $390,271.00      390271            100                 0
16 Shared room     $7,221.00          7221              1                 0
17 Shared room     $6,086.00          6086              1                 2
18 Shared room     $5,841.00          5841            100                 0
# ℹ 2 more variables: neighbourhood_cleansed &lt;chr&gt;, name &lt;chr&gt;</code></pre>
</div>
</div>
<p><strong>Interpretation</strong></p>
<p>At this point, the analysis moves from abstract rules to concrete questions:</p>
<ul>
<li><p>Are these listings luxury properties?</p></li>
<li><p>Do they require unusually long minimum stays?</p></li>
<li><p>Do they have very few (or no) reviews, suggesting new or inactive listings?</p></li>
<li><p>Are they located in specific neighborhoods?</p></li>
</ul>
<p>The answers to these questions determine whether a flagged observation should be:</p>
<ul>
<li><p>kept and modeled explicitly,</p></li>
<li><p>transformed (e.g., via log scaling),</p></li>
<li><p>or excluded due to data quality concerns.</p></li>
</ul>
</section>
<section id="z-scorebased-outlier-detection-concept-and-limitations" class="level3" data-number="6.5">
<h3 data-number="6.5" class="anchored" data-anchor-id="z-scorebased-outlier-detection-concept-and-limitations"><span class="header-section-number">6.5</span> Z-score–based outlier detection: concept and limitations</h3>
<p>In addition to IQR-based rules, outliers are often discussed using <strong>Z-scores</strong>. Because this method is widely taught and frequently applied, it is important to understand both how it works and when it can be misleading.</p>
<section id="what-is-a-z-score" class="level4" data-number="6.5.1">
<h4 data-number="6.5.1" class="anchored" data-anchor-id="what-is-a-z-score"><span class="header-section-number">6.5.1</span> What is a Z-score?</h4>
<p>A Z-score measures how far an observation lies from the mean, expressed in units of standard deviation. For a single observation, the Z-score is defined as:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Az%20=%20%5Cfrac%7Bx%20-%20%5Cmu%7D%7B%5Csigma%7D%0A"></p>
<p>where:</p>
<ul>
<li><img src="https://latex.codecogs.com/png.latex?%5Cmu"> is the sample mean</li>
<li><img src="https://latex.codecogs.com/png.latex?%5Csigma"> is the sample standard deviation</li>
</ul>
<p>Intuitively, the Z-score answers the question:</p>
<blockquote class="blockquote">
<p>“How many standard deviations away from the mean is this observation?”</p>
</blockquote>
<p>A common heuristic labels observations with</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%7Cz%7C%20%3E%203%0A"></p>
<p>as potential outliers.</p>
</section>
<section id="what-does-the-z-score-assume" class="level4" data-number="6.5.2">
<h4 data-number="6.5.2" class="anchored" data-anchor-id="what-does-the-z-score-assume"><span class="header-section-number">6.5.2</span> What does the Z-score assume?</h4>
<p>Z-score–based detection implicitly relies on several assumptions:</p>
<ul>
<li>the distribution is approximately symmetric,</li>
<li>the mean and standard deviation are meaningful summaries,</li>
<li>extreme values do not dominate the estimation of <img src="https://latex.codecogs.com/png.latex?%5Cmu"> and <img src="https://latex.codecogs.com/png.latex?%5Csigma">.</li>
</ul>
<p>These assumptions are often reasonable for approximately normal data, but they are problematic for strongly skewed distributions.</p>
</section>
<section id="why-z-scores-are-problematic-for-price-data" class="level4" data-number="6.5.3">
<h4 data-number="6.5.3" class="anchored" data-anchor-id="why-z-scores-are-problematic-for-price-data"><span class="header-section-number">6.5.3</span> Why Z-scores are problematic for price data</h4>
<p>Airbnb prices are typically <strong>right-skewed</strong> with long upper tails. In such cases:</p>
<ul>
<li>extreme values inflate the mean,</li>
<li>extreme values inflate the standard deviation,</li>
<li>as a result, truly extreme observations may receive <em>moderate</em> Z-scores.</li>
</ul>
<p>This leads to a paradox: the very observations we want to detect reduce their own apparent extremeness. For this reason, Z-scores tend to <strong>under-detect</strong> outliers in heavily skewed economic data.</p>
</section>
<section id="applying-z-scores-within-each-room-type" class="level4" data-number="6.5.4">
<h4 data-number="6.5.4" class="anchored" data-anchor-id="applying-z-scores-within-each-room-type"><span class="header-section-number">6.5.4</span> Applying Z-scores within each room type</h4>
<p>Despite these limitations, Z-scores can still be informative when used carefully and comparatively. As with the IQR rule, we compute Z-scores <strong>within each room type</strong> to preserve contextual meaning.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb17" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb17-1">outliers_z <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> listings_small <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb17-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">group_by</span>(room_type) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb17-3">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(</span>
<span id="cb17-4">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">mean_price =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(price_num, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">na.rm =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>),</span>
<span id="cb17-5">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sd_price   =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sd</span>(price_num, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">na.rm =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>),</span>
<span id="cb17-6">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">z_price    =</span> (price_num <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> mean_price) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> sd_price,</span>
<span id="cb17-7">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">outlier_z  =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">abs</span>(z_price) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span></span>
<span id="cb17-8">  ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb17-9">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ungroup</span>()</span></code></pre></div></div>
</div>
</section>
<section id="how-many-outliers-are-flagged-by-the-z-score-rule" class="level4" data-number="6.5.5">
<h4 data-number="6.5.5" class="anchored" data-anchor-id="how-many-outliers-are-flagged-by-the-z-score-rule"><span class="header-section-number">6.5.5</span> How many outliers are flagged by the Z-score rule?</h4>
<p>After computing Z-scores within each room type, we can summarize how many listings are flagged as outliers.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb18" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb18-1">outliers_z <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb18-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">count</span>(room_type, outlier_z) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb18-3">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">arrange</span>(room_type, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">desc</span>(outlier_z))</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code># A tibble: 12 × 3
   room_type       outlier_z     n
   &lt;chr&gt;           &lt;lgl&gt;     &lt;int&gt;
 1 Entire home/apt TRUE         24
 2 Entire home/apt FALSE     18050
 3 Entire home/apt NA         2169
 4 Hotel room      TRUE          5
 5 Hotel room      FALSE       118
 6 Hotel room      NA           34
 7 Private room    TRUE         26
 8 Private room    FALSE      6885
 9 Private room    NA         2583
10 Shared room     TRUE          1
11 Shared room     FALSE       139
12 Shared room     NA           17</code></pre>
</div>
</div>
<p><strong>Interpretation</strong></p>
<p>In many Airbnb datasets, this table reveals a striking pattern:</p>
<ul>
<li><p>The number of Z-score–based outliers is <strong>much smaller</strong> than the number detected by the IQR rule.</p></li>
<li><p>In some room types, no observations are flagged at all.</p></li>
</ul>
<p>This is a direct consequence of right-skewness: extreme prices inflate both the mean and the standard deviation, making Z-scores appear less extreme than expected.</p>
</section>
<section id="inspecting-listings-flagged-by-z-scores" class="level4" data-number="6.5.6">
<h4 data-number="6.5.6" class="anchored" data-anchor-id="inspecting-listings-flagged-by-z-scores"><span class="header-section-number">6.5.6</span> Inspecting listings flagged by Z-scores</h4>
<p>To understand what Z-scores actually flag as outliers, we inspect the most extreme listings according to their Z-score values.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb20" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb20-1">top_z_outliers <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> outliers_z <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb20-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(outlier_z) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb20-3">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">group_by</span>(room_type) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb20-4">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">arrange</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">desc</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">abs</span>(z_price))) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb20-5">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">slice_head</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">n =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb20-6">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ungroup</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb20-7">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">select</span>(</span>
<span id="cb20-8">    room_type,</span>
<span id="cb20-9">    price,</span>
<span id="cb20-10">    price_num,</span>
<span id="cb20-11">    z_price,</span>
<span id="cb20-12">    minimum_nights,</span>
<span id="cb20-13">    number_of_reviews,</span>
<span id="cb20-14">    neighbourhood_cleansed,</span>
<span id="cb20-15">    name</span>
<span id="cb20-16">  )</span>
<span id="cb20-17"></span>
<span id="cb20-18">top_z_outliers</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code># A tibble: 16 × 8
   room_type       price      price_num z_price minimum_nights number_of_reviews
   &lt;chr&gt;           &lt;chr&gt;          &lt;dbl&gt;   &lt;dbl&gt;          &lt;dbl&gt;             &lt;dbl&gt;
 1 Entire home/apt $4,437,59…   4437598   92.9             100                14
 2 Entire home/apt $2,658,60…   2658600   55.6             100                 3
 3 Entire home/apt $2,109,69…   2109690   44.1             100                 0
 4 Entire home/apt $2,000,00…   2000000   41.8             100                 0
 5 Entire home/apt $1,250,00…   1250008   26.1             100                 0
 6 Hotel room      $2,439,49…   2439497    4.84              1                 0
 7 Hotel room      $2,439,49…   2439497    4.84              1                 0
 8 Hotel room      $2,439,49…   2439497    4.84              1                 0
 9 Hotel room      $2,439,49…   2439497    4.84              1                 0
10 Hotel room      $2,433,42…   2433427    4.83              1                 0
11 Private room    $390,271.…    390271   31.3             365                 1
12 Private room    $390,271.…    390271   31.3             365                 0
13 Private room    $390,271.…    390271   31.3             365                 0
14 Private room    $390,271.…    390271   31.3             100                 0
15 Private room    $390,271.…    390271   31.3             100                 0
16 Shared room     $7,221.00       7221    3.65              1                 0
# ℹ 2 more variables: neighbourhood_cleansed &lt;chr&gt;, name &lt;chr&gt;</code></pre>
</div>
</div>
<p><strong>Interpretation</strong></p>
<p>When compared to the IQR-based outliers, these listings are often:</p>
<ul>
<li><p>less extreme in absolute price,</p></li>
<li><p>closer to the central mass of the distribution,</p></li>
<li><p>dominated by a small number of room types.</p></li>
</ul>
<p>This confirms that Z-score–based detection tends to <strong>miss many extreme but valid prices</strong> in heavily skewed data.</p>
</section>
<section id="comparing-iqr-and-z-score-results" class="level4" data-number="6.5.7">
<h4 data-number="6.5.7" class="anchored" data-anchor-id="comparing-iqr-and-z-score-results"><span class="header-section-number">6.5.7</span> Comparing IQR and Z-score results</h4>
<p>Finally, we compare how many listings are flagged by each method.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb22" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb22-1">comparison_summary <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> outliers_iqr <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb22-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">select</span>(id, room_type, outlier_iqr) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb22-3">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">left_join</span>(</span>
<span id="cb22-4">    outliers_z <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">select</span>(id, outlier_z),</span>
<span id="cb22-5">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">by =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"id"</span></span>
<span id="cb22-6">  ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb22-7">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">count</span>(outlier_iqr, outlier_z)</span>
<span id="cb22-8"></span>
<span id="cb22-9">comparison_summary</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code># A tibble: 4 × 3
  outlier_iqr outlier_z     n
  &lt;lgl&gt;       &lt;lgl&gt;     &lt;int&gt;
1 FALSE       FALSE     23329
2 TRUE        FALSE      1863
3 TRUE        TRUE         56
4 NA          NA         4803</code></pre>
</div>
</div>
<p><strong>Interpretation</strong></p>
<p>This comparison highlights a key methodological insight:</p>
<ul>
<li><p>Many listings flagged by the IQR rule are <strong>not flagged</strong> by Z-scores.</p></li>
<li><p>Listings flagged by Z-scores are almost always flagged by the IQR rule as well.</p></li>
<li><p>The overlap is asymmetric.</p></li>
</ul>
<p>In other words, the Z-score rule is more conservative and may under-detect outliers when distributions are strongly skewed. This does not make Z-scores “wrong”, but it does limit their usefulness as a primary detection method for price data.</p>
<p>In practice, Z-score–based flags often differ substantially from IQR-based flags. This difference is not an error—it reflects different assumptions.</p>
<ul>
<li><p>IQR-based methods rely on ranks and quantiles</p></li>
<li><p>Z-score–based methods rely on moments (mean and variance)</p></li>
</ul>
<p>For heavily skewed price data, IQR-based detection is usually more reliable, while Z-scores should be interpreted as a <strong>supplementary diagnostic</strong> rather than a primary rule.</p>
</section>
</section>
</section>
<section id="should-outliers-be-removed" class="level2" data-number="7">
<h2 data-number="7" class="anchored" data-anchor-id="should-outliers-be-removed"><span class="header-section-number">7</span> Should Outliers Be Removed?</h2>
<p>Detecting outliers does <strong>not</strong> imply that they should be automatically removed. Outlier detection is a diagnostic step, not a cleaning instruction.</p>
<p>In the context of Airbnb price data, many extreme values correspond to luxury properties, large homes, or special accommodation types. Blindly removing such observations may erase precisely the information that makes the data interesting.</p>
<p>Instead, several alternative strategies should be considered.</p>
<section id="verify-and-understand-the-source-of-extremeness" class="level3" data-number="7.1">
<h3 data-number="7.1" class="anchored" data-anchor-id="verify-and-understand-the-source-of-extremeness"><span class="header-section-number">7.1</span> Verify and understand the source of extremeness</h3>
<p>The first question should always be <em>why</em> an observation is extreme.</p>
<ul>
<li>Is the listing a luxury property?</li>
<li>Does it belong to a specific <code>room_type</code>?</li>
<li>Is it located in a high-demand neighborhood?</li>
<li>Is it associated with unusual booking constraints (e.g., very high <code>minimum_nights</code>)?</li>
</ul>
<p>In many cases, extreme values are <strong>valid reflections of heterogeneity</strong>, not data errors.</p>
</section>
<section id="use-transformations-or-robust-methods" class="level3" data-number="7.2">
<h3 data-number="7.2" class="anchored" data-anchor-id="use-transformations-or-robust-methods"><span class="header-section-number">7.2</span> Use transformations or robust methods</h3>
<p>When extreme values distort summaries or model estimates, removal is not the only option.</p>
<p>Common alternatives include:</p>
<ul>
<li>transforming the response variable (e.g., log transformation of prices),</li>
<li>using robust estimators that reduce sensitivity to extremes,</li>
<li>modeling medians or quantiles instead of means.</li>
</ul>
<p>These approaches preserve information while reducing the influence of extreme observations.</p>
</section>
<section id="model-extremes-explicitly-when-relevant" class="level3" data-number="7.3">
<h3 data-number="7.3" class="anchored" data-anchor-id="model-extremes-explicitly-when-relevant"><span class="header-section-number">7.3</span> Model extremes explicitly when relevant</h3>
<p>In some applications, outliers are not nuisances but the primary object of interest.</p>
<p>Examples include:</p>
<ul>
<li>luxury market analysis,</li>
<li>risk assessment,</li>
<li>rare but high-impact events.</li>
</ul>
<p>In such cases, extreme observations should be modeled explicitly rather than suppressed.</p>
</section>
<section id="a-final-perspective" class="level3" data-number="7.4">
<h3 data-number="7.4" class="anchored" data-anchor-id="a-final-perspective"><span class="header-section-number">7.4</span> A final perspective</h3>
<p>Outliers are not merely statistical inconveniences. They often highlight structural differences, market segmentation, or meaningful departures from typical behavior.</p>
<p>Understanding <em>why</em> an observation is extreme is frequently more informative than deleting it.</p>
<p>In practice, thoughtful outlier handling requires a balance between statistical rules, domain knowledge, and modeling objectives.</p>
</section>
</section>
<section id="final-remarks" class="level2" data-number="8">
<h2 data-number="8" class="anchored" data-anchor-id="final-remarks"><span class="header-section-number">8</span> Final Remarks</h2>
<p>Outlier detection is a natural step in the data preprocessing workflow. It typically follows missing data analysis and precedes scaling, transformation, or model fitting.</p>
<p>In this article, the focus was not on eliminating extreme values, but on <strong>understanding why they occur</strong>. Through visual exploration, context-aware analysis using <code>room_type</code>, and formal detection rules such as the IQR method and Z-scores, we demonstrated that extreme values are not noise by default.</p>
<p>In many real-world datasets, especially those involving prices or economic behavior, extreme values reflect structural heterogeneity rather than data quality issues. Treating them blindly as errors risks discarding meaningful information.</p>
<p>Outliers should therefore be approached as <strong>questions posed by the data</strong>: Why is this observation extreme? Does it represent a different regime, a rare event, or a distinct subgroup?</p>
<p>Answering these questions requires a combination of statistical tools, domain knowledge, and clear analytical goals. When handled thoughtfully, outlier analysis enhances both the robustness and the interpretability of downstream models.</p>
</section>
<section id="references-and-further-reading" class="level2" data-number="9">
<h2 data-number="9" class="anchored" data-anchor-id="references-and-further-reading"><span class="header-section-number">9</span> References and Further Reading</h2>
<ul>
<li><p>Tukey, J. W. (1977). <em>Exploratory Data Analysis</em>. Addison-Wesley.<br>
(Foundational reference for boxplots, IQR, and exploratory thinking.)</p></li>
<li><p>Hastie, T., Tibshirani, R., Friedman, J. (2009). <em>The Elements of Statistical Learning</em>. Springer.<br>
(Statistical foundations and the role of robust methods in modeling.)</p></li>
<li><p>James, G., Witten, D., Hastie, T., Tibshirani, R. (2021). <em>An Introduction to Statistical Learning</em>. Springer.<br>
(Accessible discussion of preprocessing, transformations, and practical modeling considerations.)</p></li>
<li><p>NIST/SEMATECH e-Handbook of Statistical Methods – Outliers<br>
<a href="https://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm" class="uri">https://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm</a><br>
(Authoritative overview of outlier concepts and detection methods.)</p></li>
<li><p>Wickham, H., Grolemund, G. (2017). <em>R for Data Science</em>. O’Reilly Media.<br>
(Practical guidance on data exploration, visualization, and preprocessing workflows in R.)</p></li>
<li><p>Inside Airbnb – Get the Data<br>
<a href="https://insideairbnb.com/get-the-data/" class="uri">https://insideairbnb.com/get-the-data/</a><br>
(Data source used in this article; city-level Airbnb listings data.)</p></li>
<li><p>Wickham, H. (2016). <em>ggplot2: Elegant Graphics for Data Analysis</em>. Springer.<br>
(Principles of layered graphics and effective visualization used throughout the article.)</p></li>
</ul>


<!-- -->

</section>

 ]]></description>
  <category>R</category>
  <category>Statistics</category>
  <category>Data Analysis</category>
  <category>Data Science</category>
  <category>Data Preprocessing</category>
  <category>Outliers</category>
  <category>Inter Quartile Range</category>
  <category>Z-Score</category>
  <category>Data Cleaning</category>
  <guid>https://mfatihtuzen.github.io/posts/2025-12-19_outliers/</guid>
  <pubDate>Fri, 19 Dec 2025 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Handling Missing Data in R: A Comprehensive Guide</title>
  <dc:creator>M. Fatih Tüzen</dc:creator>
  <link>https://mfatihtuzen.github.io/posts/2025-08-18_missing_values/</link>
  <description><![CDATA[ 






<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://mfatihtuzen.github.io/posts/2025-08-18_missing_values/missing_values.png" class="img-fluid quarto-figure quarto-figure-center figure-img" style="width:85.0%"></p>
</figure>
</div>
<section id="introduction" class="level2" data-number="1">
<h2 data-number="1" class="anchored" data-anchor-id="introduction"><span class="header-section-number">1</span> Introduction</h2>
<p>Data preprocessing is a cornerstone of any data analysis or machine learning pipeline. Raw data rarely comes in a form ready for direct analysis — it often requires cleaning, transformation, normalization, and careful handling of anomalies. Among these preprocessing tasks, dealing with missing data stands out as one of the most critical and unavoidable challenges.</p>
<p>Missing values appear in virtually every domain: surveys may have skipped questions, administrative registers might contain incomplete records, and clinical trials can suffer from dropout patients. Ignoring these gaps or handling them naively does not just reduce the amount of usable information; it can also introduce bias, decrease statistical power, and ultimately compromise the validity of conclusions. In other words, missing data is not just an inconvenience — it is a methodological problem that demands rigorous attention.</p>
<p>In statistical practice, missingness is often represented as <code>NA</code> (Not Available) in R. However, not all missing values are created equal. Some are missing completely at random, others depend on observed variables, and in some cases, the missingness itself carries meaningful information. Understanding these mechanisms is essential before deciding how to address them. This makes missing data imputation a fundamental part of the broader data preprocessing workflow, alongside tasks such as outlier detection, data normalization, and feature engineering.</p>
<p>In this article, we will cover:</p>
<ul>
<li>The theoretical foundations of missing data mechanisms (MCAR, MAR, MNAR).</li>
<li>How to detect and visualize missing values in R.</li>
<li>Different strategies for handling missingness, from simple imputation to advanced multiple imputation techniques.</li>
<li>A practical workflow using the NHANES dataset, widely used in health research, to demonstrate methods in R.</li>
<li>Best practices, pitfalls, and recommendations for applied data science.</li>
</ul>
<p>We will use several R packages throughout this tutorial:</p>
<ul>
<li><strong>tidyverse</strong>: Data wrangling and visualization</li>
<li><strong>naniar</strong> and <strong>VIM</strong>: Tools for exploring and visualizing missing data</li>
<li><strong>mice</strong>: Multiple imputation by chained equations</li>
<li><strong>missForest</strong>: Random forest–based imputation for nonlinear data</li>
</ul>
<p>By integrating missing data handling into the larger context of preprocessing, this structured approach will not only help you manage incomplete datasets effectively but also ensure that your entire analytical workflow remains <strong>robust, transparent, and reliable</strong>.</p>
</section>
<section id="nhanes-dataset" class="level2" data-number="2">
<h2 data-number="2" class="anchored" data-anchor-id="nhanes-dataset"><span class="header-section-number">2</span> NHANES Dataset</h2>
<p>In this section, we will work with the <strong>NHANES</strong> dataset, which comes from the US National Health and Nutrition Examination Survey.<br>
The dataset includes demographic, examination, and laboratory data collected from thousands of individuals.<br>
Since the full dataset is quite large, we will focus only on a subset of variables that are relevant for preprocessing examples.</p>
<p>Here are the variables we will use:</p>
<ul>
<li><strong>ID</strong>: Unique identifier for each participant</li>
<li><strong>Age</strong>: Age of the participant</li>
<li><strong>Gender</strong>: Biological sex (male or female)</li>
<li><strong>BMI</strong>: Body Mass Index</li>
<li><strong>BPSysAve</strong>: Average systolic blood pressure</li>
<li><strong>Diabetes</strong>: Whether the participant has been diagnosed with diabetes</li>
</ul>
<p>Before diving into preprocessing, let’s take a quick look at the structure of these selected variables:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(NHANES)</span>
<span id="cb1-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(dplyr)</span>
<span id="cb1-3"></span>
<span id="cb1-4"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"NHANES"</span>)</span>
<span id="cb1-5"></span>
<span id="cb1-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Select relevant variables</span></span>
<span id="cb1-7">nhanes_sub <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> NHANES <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb1-8">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">select</span>(ID, Age, Gender, BMI, BPSysAve, Diabetes)</span>
<span id="cb1-9"></span>
<span id="cb1-10"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">glimpse</span>(nhanes_sub)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>Rows: 10,000
Columns: 6
$ ID       &lt;int&gt; 51624, 51624, 51624, 51625, 51630, 51638, 51646, 51647, 51647…
$ Age      &lt;int&gt; 34, 34, 34, 4, 49, 9, 8, 45, 45, 45, 66, 58, 54, 10, 58, 50, …
$ Gender   &lt;fct&gt; male, male, male, male, female, male, male, female, female, f…
$ BMI      &lt;dbl&gt; 32.22, 32.22, 32.22, 15.30, 30.57, 16.82, 20.64, 27.24, 27.24…
$ BPSysAve &lt;int&gt; 113, 113, 113, NA, 112, 86, 107, 118, 118, 118, 111, 104, 134…
$ Diabetes &lt;fct&gt; No, No, No, No, No, No, No, No, No, No, No, No, No, No, No, N…</code></pre>
</div>
</div>
</section>
<section id="why-missingness-matters" class="level2" data-number="3">
<h2 data-number="3" class="anchored" data-anchor-id="why-missingness-matters"><span class="header-section-number">3</span> Why Missingness Matters</h2>
<p>Missing data is not just an inconvenience — it can distort the statistical conclusions we draw from a dataset.<br>
There are several critical reasons why handling missingness properly is essential:</p>
<ul>
<li><strong>Biased results</strong>: If the missing values are not random, analyses may systematically misrepresent the population.</li>
<li><strong>Reduced sample size</strong>: Complete-case analysis (simply dropping missing rows) reduces data availability, weakening statistical power.</li>
<li><strong>Model incompatibility</strong>: Many modeling techniques in R (e.g., <code>lm()</code>, <code>glm()</code>) require complete data, and will automatically drop cases with missing values, sometimes silently.</li>
</ul>
<section id="a-short-case-example-bmi-missingness-and-blood-pressure" class="level3" data-number="3.1">
<h3 data-number="3.1" class="anchored" data-anchor-id="a-short-case-example-bmi-missingness-and-blood-pressure"><span class="header-section-number">3.1</span> A Short Case Example: BMI Missingness and Blood Pressure</h3>
<p>Suppose we want to explore how <strong>Body Mass Index (BMI)</strong> relates to <strong>Systolic Blood Pressure (BPSysAve)</strong>.<br>
However, BMI contains missing values. If we ignore them and only analyze complete cases, we may end up with biased conclusions.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb3-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># How many missing in BMI?</span></span>
<span id="cb3-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sum</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(nhanes_sub<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>BMI))</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 366</code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb5-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Complete-case dataset (dropping missing BMI)</span></span>
<span id="cb5-2">nhanes_complete <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> nhanes_sub <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb5-3">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(BMI))</span>
<span id="cb5-4"></span>
<span id="cb5-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Compare sample sizes</span></span>
<span id="cb5-6"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">nrow</span>(nhanes_sub)     <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># original sample size</span></span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 10000</code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb7-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">nrow</span>(nhanes_complete) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># after dropping missing BMI</span></span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 9634</code></pre>
</div>
</div>
<p>We see that a substantial portion of the data is dropped when we remove missing BMI values. This reduction not only decreases efficiency but can also <strong>bias the estimates</strong> if those missing values are not randomly distributed.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb9-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Fit regression with complete cases only</span></span>
<span id="cb9-2">model_complete <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lm</span>(BPSysAve <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> BMI <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> Age <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> Gender, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data =</span> nhanes_complete)</span>
<span id="cb9-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summary</span>(model_complete)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>
Call:
lm(formula = BPSysAve ~ BMI + Age + Gender, data = nhanes_complete)

Residuals:
    Min      1Q  Median      3Q     Max 
-56.281  -8.652  -0.955   7.560 102.790 

Coefficients:
             Estimate Std. Error t value Pr(&gt;|t|)    
(Intercept) 90.016503   0.677618  132.84   &lt;2e-16 ***
BMI          0.328076   0.023228   14.12   &lt;2e-16 ***
Age          0.412758   0.008076   51.11   &lt;2e-16 ***
Gendermale   4.346847   0.313476   13.87   &lt;2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.43 on 8483 degrees of freedom
  (1147 observations deleted due to missingness)
Multiple R-squared:  0.2969,    Adjusted R-squared:  0.2966 
F-statistic:  1194 on 3 and 8483 DF,  p-value: &lt; 2.2e-16</code></pre>
</div>
</div>
<p><strong>Interpretation</strong>:</p>
<ul>
<li>The model only uses complete cases, ignoring potentially informative missingness.</li>
<li>If BMI is more often missing in certain subgroups (e.g., older adults or females), then the relationship estimated here does not represent the whole population.</li>
<li>In later sections, we will see how different imputation strategies can mitigate this problem.</li>
</ul>
</section>
</section>
<section id="missing-data-mechanisms" class="level2" data-number="4">
<h2 data-number="4" class="anchored" data-anchor-id="missing-data-mechanisms"><span class="header-section-number">4</span> Missing Data Mechanisms</h2>
<p>One of the most crucial aspects of handling missing data is to understand <strong>why</strong> the data are missing.<br>
The mechanism behind missingness determines whether our chosen method will yield unbiased and efficient estimates.</p>
<section id="types-of-missing-data-mechanisms" class="level3" data-number="4.1">
<h3 data-number="4.1" class="anchored" data-anchor-id="types-of-missing-data-mechanisms"><span class="header-section-number">4.1</span> Types of Missing Data Mechanisms</h3>
<ul>
<li><p><strong>MCAR (Missing Completely At Random)</strong><br>
The probability of a value being missing does not depend on either the observed or the unobserved data.<br>
→ Example: A lab machine randomly fails for some patients, regardless of their characteristics.<br>
→ Implication: Complete-case analysis is valid (though less efficient).</p></li>
<li><p><strong>MAR (Missing At Random)</strong><br>
The probability of missingness depends only on the <strong>observed</strong> data, not on the missing values themselves.<br>
→ Example: People with lower income are less likely to report their weight, but we observe income.<br>
→ Implication: Multiple imputation or likelihood-based methods can recover unbiased estimates.</p></li>
<li><p><strong>MNAR (Missing Not At Random)</strong><br>
The probability of missingness depends on the <strong>unobserved</strong> value itself.<br>
→ Example: People with higher BMI are less likely to report their weight.<br>
→ Implication: Strong assumptions or external information are needed; imputation under MAR will still be biased.</p></li>
</ul>
</section>
<section id="what-each-mechanism-implies-with-nhanes-intuition" class="level3" data-number="4.2">
<h3 data-number="4.2" class="anchored" data-anchor-id="what-each-mechanism-implies-with-nhanes-intuition"><span class="header-section-number">4.2</span> What each mechanism implies (with NHANES intuition)</h3>
<ul>
<li><p><strong>MCAR</strong> — e.g., random device failure that occasionally prevents recording <code>BMI</code>.<br>
<em>Implication:</em> Complete-case analysis (dropping rows) is unbiased but wastes data.</p></li>
<li><p><strong>MAR</strong> — e.g., <code>BMI</code> missingness varies by observed <strong>Age</strong> or <strong>Gender</strong>.<br>
<em>Implication:</em> Likelihood-based methods or <strong>Multiple Imputation (MI)</strong> are valid if those predictors are in the imputation model.</p></li>
<li><p><strong>MNAR</strong> — e.g., people with <strong>very high BMI</strong> systematically do not report it.<br>
<em>Implication:</em> MAR-based methods still biased; requires <strong>sensitivity analysis</strong> or explicit MNAR models.</p></li>
</ul>
</section>
<section id="quick-nhanes-checks-that-suggest-a-mechanism" class="level3" data-number="4.3">
<h3 data-number="4.3" class="anchored" data-anchor-id="quick-nhanes-checks-that-suggest-a-mechanism"><span class="header-section-number">4.3</span> Quick NHANES checks that suggest a mechanism</h3>
<p>Below we do two simple diagnostics on our working subset <code>nhanes_sub</code><br>
(defined earlier as: <code>NHANES |&gt; select(ID, Age, Gender, BMI, BPSysAve, Diabetes)</code>).</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb11" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb11-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Packages we already use</span></span>
<span id="cb11-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(dplyr)</span>
<span id="cb11-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(knitr)</span>
<span id="cb11-4"></span>
<span id="cb11-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># 1) Overall BMI missingness</span></span>
<span id="cb11-6">nhanes_sub <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb11-7">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summarise</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">pct_missing_BMI =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(BMI)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb11-8">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">pct_missing_BMI =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">round</span>(pct_missing_BMI, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb11-9">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">kable</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">caption =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Overall BMI missingness (%)"</span>)</span></code></pre></div></div>
<div class="cell-output-display">
<table class="caption-top table table-sm table-striped small">
<caption>Overall BMI missingness (%)</caption>
<thead>
<tr class="header">
<th style="text-align: right;">pct_missing_BMI</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: right;">3.7</td>
</tr>
</tbody>
</table>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb12" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb12-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># 2) Does BMI missingness vary by observed variables? (MAR hint)</span></span>
<span id="cb12-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#    - By Gender</span></span>
<span id="cb12-3">by_gender <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> nhanes_sub <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb12-4">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">group_by</span>(Gender) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb12-5">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summarise</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">pct_miss_BMI =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(BMI)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>,</span>
<span id="cb12-6">            <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">n =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">n</span>(), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">.groups =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"drop"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb12-7">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">pct_miss_BMI =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">round</span>(pct_miss_BMI, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>))</span>
<span id="cb12-8"></span>
<span id="cb12-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#    - By Age groups (bins)</span></span>
<span id="cb12-10">by_age <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> nhanes_sub <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb12-11">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">AgeBand =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">cut</span>(Age, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">breaks =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">30</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">45</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">60</span>, <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">Inf</span>),</span>
<span id="cb12-12">                       <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">labels =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"&lt;=30"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"31–45"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"46–60"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"60+"</span>),</span>
<span id="cb12-13">                       <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">right =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span>)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb12-14">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">group_by</span>(AgeBand) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb12-15">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summarise</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">pct_miss_BMI =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(BMI)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>,</span>
<span id="cb12-16">            <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">n =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">n</span>(), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">.groups =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"drop"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb12-17">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">pct_miss_BMI =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">round</span>(pct_miss_BMI, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>))</span>
<span id="cb12-18"></span>
<span id="cb12-19"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Show summaries nicely</span></span>
<span id="cb12-20"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">kable</span>(by_gender, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">caption =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"BMI missingness by Gender (%)"</span>)</span></code></pre></div></div>
<div class="cell-output-display">
<table class="caption-top table table-sm table-striped small">
<caption>BMI missingness by Gender (%)</caption>
<thead>
<tr class="header">
<th style="text-align: left;">Gender</th>
<th style="text-align: right;">pct_miss_BMI</th>
<th style="text-align: right;">n</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;">female</td>
<td style="text-align: right;">3.6</td>
<td style="text-align: right;">5020</td>
</tr>
<tr class="even">
<td style="text-align: left;">male</td>
<td style="text-align: right;">3.8</td>
<td style="text-align: right;">4980</td>
</tr>
</tbody>
</table>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb13" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb13-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">kable</span>(by_age,    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">caption =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"BMI missingness by Age band (%)"</span>)</span></code></pre></div></div>
<div class="cell-output-display">
<table class="caption-top table table-sm table-striped small">
<caption>BMI missingness by Age band (%)</caption>
<thead>
<tr class="header">
<th style="text-align: left;">AgeBand</th>
<th style="text-align: right;">pct_miss_BMI</th>
<th style="text-align: right;">n</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;">&lt;=30</td>
<td style="text-align: right;">7.6</td>
<td style="text-align: right;">4121</td>
</tr>
<tr class="even">
<td style="text-align: left;">31–45</td>
<td style="text-align: right;">0.5</td>
<td style="text-align: right;">2049</td>
</tr>
<tr class="odd">
<td style="text-align: left;">46–60</td>
<td style="text-align: right;">0.6</td>
<td style="text-align: right;">1991</td>
</tr>
<tr class="even">
<td style="text-align: left;">60+</td>
<td style="text-align: right;">1.7</td>
<td style="text-align: right;">1839</td>
</tr>
</tbody>
</table>
</div>
</div>
<p><strong>Interpretation:</strong></p>
<ul>
<li><p>If <code>pct_miss_BMI</code> is <strong>similar across groups</strong>, MCAR is more plausible.</p></li>
<li><p>If missingness <strong>changes with Age or Gender</strong>, <strong>MAR</strong> is more plausible (we must include those predictors in imputation).</p></li>
<li><p>These are <em>indicators</em>, not proofs; true <strong>MNAR</strong> needs external info or sensitivity analyses.</p></li>
</ul>
<p><strong>Which methods are valid under which mechanism?</strong></p>
<table class="caption-top table">
<colgroup>
<col style="width: 6%">
<col style="width: 32%">
<col style="width: 35%">
<col style="width: 25%">
</colgroup>
<thead>
<tr class="header">
<th>Mechanism</th>
<th>Example (NHANES context)</th>
<th>Valid methods</th>
<th>Notes</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>MCAR</strong></td>
<td>Random loss of <code>BMI</code> records</td>
<td>Complete-case, single imputation, MI</td>
<td>Unbiased but may waste data</td>
</tr>
<tr class="even">
<td><strong>MAR</strong></td>
<td><code>BMI</code> missingness varies by observed <code>Age</code>, <code>Gender</code></td>
<td><strong>Multiple Imputation (MICE)</strong>, likelihood/EM, missForest</td>
<td>Include strong predictors of missingness</td>
</tr>
<tr class="odd">
<td><strong>MNAR</strong></td>
<td>People with very high <code>BMI</code> hide it</td>
<td>Sensitivity analysis, selection/pattern-mixture models</td>
<td>MAR-based MI alone is biased</td>
</tr>
</tbody>
</table>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Note</span>Optional: Little’s MCAR Test
</div>
</div>
<div class="callout-body-container callout-body">
<p>Little’s MCAR test is a statistical procedure used to examine whether data are <strong>Missing Completely at Random (MCAR)</strong>.</p>
<p>⚠️ However, this test comes with important caveats:<br>
- It can be overly sensitive in <strong>large samples</strong>, flagging trivial deviations.<br>
- In <strong>small samples</strong>, its power is often too low to detect meaningful departures from MCAR.</p>
<p>Because of these limitations, it should be treated only as a <strong>supporting tool</strong> rather than a definitive test when diagnosing missingness mechanisms.</p>
</div>
</div>
</section>
</section>
<section id="detecting-missing-data" class="level2" data-number="5">
<h2 data-number="5" class="anchored" data-anchor-id="detecting-missing-data"><span class="header-section-number">5</span> Detecting Missing Data</h2>
<p>Before applying any imputation or modeling technique, it is essential to explore the <strong>extent and structure of missingness</strong> in the dataset. The <code>nhanes_sub</code> data frame, derived from the NHANES dataset, will be used for illustration.</p>
<section id="simple-counts-and-summaries" class="level3" data-number="5.1">
<h3 data-number="5.1" class="anchored" data-anchor-id="simple-counts-and-summaries"><span class="header-section-number">5.1</span> Simple Counts and Summaries</h3>
<p>The first step is to quantify how many values are missing per variable.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb14" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb14-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Count missing values for each variable</span></span>
<span id="cb14-2">nhanes_sub <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb14-3">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summarise</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">across</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">everything</span>(), <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sum</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(.)))) </span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code># A tibble: 1 × 6
     ID   Age Gender   BMI BPSysAve Diabetes
  &lt;int&gt; &lt;int&gt;  &lt;int&gt; &lt;int&gt;    &lt;int&gt;    &lt;int&gt;
1     0     0      0   366     1449      142</code></pre>
</div>
</div>
<p>The output shows the number of missing values in each column, making it easy to spot problematic variables. Another quick check is to identify how many <strong>complete vs.&nbsp;incomplete cases</strong> exist:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb16" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb16-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sum</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">complete.cases</span>(nhanes_sub))       <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># number of complete rows</span></span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 8482</code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb18" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb18-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sum</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">complete.cases</span>(nhanes_sub))      <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># number of incomplete rows</span></span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 1518</code></pre>
</div>
</div>
<p>This gives us an idea of the proportion of observations that would be lost if we opted for <strong>listwise deletion</strong>.</p>
</section>
<section id="visualizing-missingness" class="level3" data-number="5.2">
<h3 data-number="5.2" class="anchored" data-anchor-id="visualizing-missingness"><span class="header-section-number">5.2</span> Visualizing Missingness</h3>
<p>Textual summaries are informative, but missing data often has <strong>patterns</strong> that are better revealed visually. Several R packages support this task:</p>
<section id="naniar" class="level4" data-number="5.2.1">
<h4 data-number="5.2.1" class="anchored" data-anchor-id="naniar"><span class="header-section-number">5.2.1</span> <code>naniar</code></h4>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb20" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb20-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(naniar)</span>
<span id="cb20-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(ggplot2)</span>
<span id="cb20-3"></span>
<span id="cb20-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Visualize missing values by variable</span></span>
<span id="cb20-5"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">gg_miss_var</span>(nhanes_sub, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">show_pct =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb20-6">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">labs</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">title =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Missing Values by Variable in NHANES Subset"</span>,</span>
<span id="cb20-7">       <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Variables"</span>,</span>
<span id="cb20-8">       <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Proportion of Missing Values"</span>)</span></code></pre></div></div>
<div class="cell-output-display">
<div>
<figure class="figure">
<p><img src="https://mfatihtuzen.github.io/posts/2025-08-18_missing_values/index_files/figure-html/unnamed-chunk-7-1.png" class="img-fluid figure-img" width="672"></p>
</figure>
</div>
</div>
</div>
<ul>
<li><p>Each bar corresponds to a variable.</p></li>
<li><p>The <strong>height of the bar</strong> shows how many observations are missing for that variable.</p></li>
<li><p>With <code>show_pct = TRUE</code>, the proportion of missing values is also displayed, making it easier to compare across variables.</p></li>
<li><p>Variables with tall bars clearly have higher missingness (e.g., BMI or blood pressure variables often stand out in this dataset).</p></li>
</ul>
</section>
<section id="vim" class="level4" data-number="5.2.2">
<h4 data-number="5.2.2" class="anchored" data-anchor-id="vim"><span class="header-section-number">5.2.2</span> <code>VIM</code></h4>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb21" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb21-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(VIM)</span>
<span id="cb21-2"></span>
<span id="cb21-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aggr</span>(nhanes_sub, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">numbers =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">prop =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sortVar =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>)</span></code></pre></div></div>
<div class="cell-output-display">
<div>
<figure class="figure">
<p><img src="https://mfatihtuzen.github.io/posts/2025-08-18_missing_values/index_files/figure-html/unnamed-chunk-8-1.png" class="img-fluid figure-img" width="672"></p>
</figure>
</div>
</div>
<div class="cell-output cell-output-stdout">
<pre><code>
 Variables sorted by number of missings: 
 Variable Count
 BPSysAve  1449
      BMI   366
 Diabetes   142
       ID     0
      Age     0
   Gender     0</code></pre>
</div>
</div>
<p>This aggregated visualization shows the proportion of missing values per variable and the combinations of missingness across variables.</p>
</section>
<section id="visdat" class="level4" data-number="5.2.3">
<h4 data-number="5.2.3" class="anchored" data-anchor-id="visdat"><span class="header-section-number">5.2.3</span> <code>visdat</code></h4>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb23" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb23-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(visdat)</span>
<span id="cb23-2"></span>
<span id="cb23-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">vis_dat</span>(nhanes_sub)</span></code></pre></div></div>
<div class="cell-output-display">
<div>
<figure class="figure">
<p><img src="https://mfatihtuzen.github.io/posts/2025-08-18_missing_values/index_files/figure-html/unnamed-chunk-9-1.png" class="img-fluid figure-img" width="672"></p>
</figure>
</div>
</div>
</div>
<p>This function displays the data type of each variable and overlays missingness, helping to identify whether missing values cluster in certain variable types (e.g., numeric vs.&nbsp;categorical).</p>
</section>
</section>
<section id="interpreting-the-patterns" class="level3" data-number="5.3">
<h3 data-number="5.3" class="anchored" data-anchor-id="interpreting-the-patterns"><span class="header-section-number">5.3</span> Interpreting the Patterns</h3>
<ul>
<li><strong>Random scatter of missing values</strong> across rows/columns may indicate <strong>MCAR</strong> (though formal testing is required).</li>
<li><strong>Systematic patterns</strong> (e.g., older participants more likely to have missing BMI) hint at <strong>MAR</strong>.</li>
<li><strong>Blocks of missingness</strong> (entire variables missing for subgroups) may suggest <strong>MNAR</strong> or structural missingness.</li>
</ul>
</section>
</section>
<section id="handling-missing-data-methods" class="level2" data-number="6">
<h2 data-number="6" class="anchored" data-anchor-id="handling-missing-data-methods"><span class="header-section-number">6</span> Handling Missing Data — Methods</h2>
<p>In this section we review the main families of methods, show <strong>when</strong> each is appropriate, and demonstrate them on <code>nhanes_sub</code>. We will explicitly call out the <strong>trade-offs</strong> so readers can choose deliberately—not by habit.</p>
<hr>
<section id="deletion" class="level3" data-number="6.1">
<h3 data-number="6.1" class="anchored" data-anchor-id="deletion"><span class="header-section-number">6.1</span> Deletion</h3>
<p><strong>Listwise deletion (complete-case)</strong> removes any row that contains <em>at least one</em> missing value.<br>
<strong>Pairwise deletion</strong> uses all available pairs to compute correlations/covariances, which can later lead to <strong>non–positive-definite</strong> covariance matrices and failures in modeling.</p>
<ul>
<li><p><strong>Pros</strong> - Simple; widely implemented by default (often silently). - Unbiased <em>only</em> under <strong>MCAR</strong>.</p></li>
<li><p><strong>Cons</strong> - Wastes data; reduces power. - Biased under <strong>MAR/MNAR</strong>; can change the sample composition.</p></li>
</ul>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb24" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb24-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># How many rows would we lose if we required complete cases for these variables?</span></span>
<span id="cb24-2">n_total <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">nrow</span>(nhanes_sub)</span>
<span id="cb24-3">n_cc    <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> nhanes_sub <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> stats<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">complete.cases</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sum</span>()</span>
<span id="cb24-4"></span>
<span id="cb24-5"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">cbind</span>(</span>
<span id="cb24-6">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">total_rows    =</span> n_total,</span>
<span id="cb24-7">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">complete_cases=</span> n_cc,</span>
<span id="cb24-8">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">lost_rows     =</span> n_total <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> n_cc,</span>
<span id="cb24-9">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">lost_pct      =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">round</span>((n_total <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> n_cc) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> n_total <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)</span>
<span id="cb24-10">)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>     total_rows complete_cases lost_rows lost_pct
[1,]      10000           8482      1518     15.2</code></pre>
</div>
</div>
<p><strong>Interpretation:</strong> If the lost percentage is non-trivial (e.g., &gt;5–10%), listwise deletion both <strong>shrinks power</strong> and <strong>risks bias</strong> unless MCAR truly holds. Pairwise deletion is <strong>not recommended</strong> for modeling because it can yield inconsistent covariance structures.</p>
</section>
<section id="simple-imputation" class="level3" data-number="6.2">
<h3 data-number="6.2" class="anchored" data-anchor-id="simple-imputation"><span class="header-section-number">6.2</span> Simple Imputation</h3>
<p><strong>Idea.</strong> Fill missing values with a single plausible value (one pass). Fast and convenient, but it <strong>underestimates uncertainty</strong> (standard errors too small) and can <strong>distort distributions</strong>.</p>
<p><strong>Typical choices</strong></p>
<ul>
<li><p><strong>Mean/Median/Mode</strong> (baselines; median is more robust to skew)</p></li>
<li><p><strong>k-Nearest Neighbors (kNN)</strong> (borrows information from similar rows)</p></li>
<li><p><strong>Hot-deck</strong> (donor-based; similar spirit to kNN)</p></li>
</ul>
<section id="median-numeric-mode-categorical-baselines" class="level4" data-number="6.2.1">
<h4 data-number="6.2.1" class="anchored" data-anchor-id="median-numeric-mode-categorical-baselines"><span class="header-section-number">6.2.1</span> Median (numeric) + Mode (categorical) baselines</h4>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb26" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb26-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">set.seed</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2025</span>)</span>
<span id="cb26-2"></span>
<span id="cb26-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Create a median-imputed BMI for illustration (only if BMI is missing)</span></span>
<span id="cb26-4">nh_med <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> nhanes_sub <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb26-5">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(</span>
<span id="cb26-6">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">BMI_med =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ifelse</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(BMI), stats<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">median</span>(BMI, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">na.rm =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>), BMI)</span>
<span id="cb26-7">  )</span>
<span id="cb26-8"></span>
<span id="cb26-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Compare how many BMI were imputed</span></span>
<span id="cb26-10"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sum</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(nhanes_sub<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>BMI))           <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># original missing BMI count</span></span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 366</code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb28" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb28-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sum</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(nh_med<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>BMI_med))           <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># should be 0</span></span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 0</code></pre>
</div>
</div>
<p><strong>Distribution distortion (variance shrinkage).</strong></p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb30" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb30-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(ggplot2)</span>
<span id="cb30-2"></span>
<span id="cb30-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Compare BMI distribution: complete-case vs median-imputed</span></span>
<span id="cb30-4">p_cc  <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> nhanes_sub <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb30-5">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(BMI)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb30-6">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggplot</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> BMI)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb30-7">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_density</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb30-8">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">labs</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">title =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"BMI density — complete cases"</span>)</span>
<span id="cb30-9"></span>
<span id="cb30-10">p_med <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> nh_med <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb30-11">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggplot</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> BMI_med)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb30-12">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_density</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb30-13">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">labs</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">title =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"BMI density — median-imputed"</span>)</span>
<span id="cb30-14"></span>
<span id="cb30-15">p_cc; p_med</span></code></pre></div></div>
<div class="cell-output-display">
<div>
<figure class="figure">
<p><img src="https://mfatihtuzen.github.io/posts/2025-08-18_missing_values/index_files/figure-html/unnamed-chunk-12-1.png" class="img-fluid figure-img" width="672"></p>
</figure>
</div>
</div>
<div class="cell-output-display">
<div>
<figure class="figure">
<p><img src="https://mfatihtuzen.github.io/posts/2025-08-18_missing_values/index_files/figure-html/unnamed-chunk-12-2.png" class="img-fluid figure-img" width="672"></p>
</figure>
</div>
</div>
</div>
<p><strong>Interpretation:</strong> Median imputation <strong>spikes</strong> the distribution around the median and <strong>reduces variance</strong>. This can attenuate real relationships that depend on dispersion.</p>
</section>
<section id="knn-donor-based-imputation" class="level4" data-number="6.2.2">
<h4 data-number="6.2.2" class="anchored" data-anchor-id="knn-donor-based-imputation"><span class="header-section-number">6.2.2</span> kNN (donor-based) imputation</h4>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb31" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb31-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># kNN imputation with VIM::kNN (works on data frames; chooses donors by similarity)</span></span>
<span id="cb31-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(VIM)</span>
<span id="cb31-3"></span>
<span id="cb31-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># We impute only BMI here; set k=5 as a reasonable starting point.</span></span>
<span id="cb31-5">nh_knn <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> nhanes_sub <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb31-6">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">select</span>(Age, Gender, BMI, BPSysAve, Diabetes) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb31-7">  VIM<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">kNN</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">k =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">imp_var =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span>)  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># imp_var=FALSE avoids extra *_imp columns</span></span>
<span id="cb31-8"></span>
<span id="cb31-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Check imputation effect</span></span>
<span id="cb31-10"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sum</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(nhanes_sub<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>BMI))   <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># original missing BMI</span></span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 366</code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb33" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb33-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sum</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(nh_knn<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>BMI))       <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># after kNN (should be 0)</span></span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 0</code></pre>
</div>
</div>
<p><strong>Interpretation:</strong> kNN preserves local structure better than mean/median, but it is still <strong>single imputation</strong> → uncertainty is <strong>not</strong> propagated. Choice of <strong>k</strong> and included predictors matters.</p>
<blockquote class="blockquote">
<p><strong>Rule of thumb.</strong> Simple methods are acceptable for quick EDA or as baselines. For principled inference under MAR, prefer <strong>Multiple Imputation</strong>.</p>
</blockquote>
</section>
</section>
<section id="advanced-methods" class="level3" data-number="6.3">
<h3 data-number="6.3" class="anchored" data-anchor-id="advanced-methods"><span class="header-section-number">6.3</span> Advanced Methods</h3>
<section id="multiple-imputation-with-mice" class="level4" data-number="6.3.1">
<h4 data-number="6.3.1" class="anchored" data-anchor-id="multiple-imputation-with-mice"><span class="header-section-number">6.3.1</span> Multiple Imputation with <code>mice</code></h4>
<p>So far, we have seen that missing values exist in several variables of our dataset. A common and powerful approach to handle missingness is <strong>Multiple Imputation by Chained Equations (MICE)</strong>. The <code>mice</code> package in R is widely used for this purpose. The idea is simple:</p>
<ul>
<li>Instead of filling in missing values once, MICE creates <strong>multiple complete datasets</strong> by imputing values several times.</li>
<li>Each dataset is then analyzed separately.</li>
<li>Finally, results are pooled together to account for the variability introduced by missingness.</li>
</ul>
<p>Let’s try this approach on our subset of the <code>NHANES</code> data:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb35" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb35-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(mice)</span>
<span id="cb35-2"></span>
<span id="cb35-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Create imputations</span></span>
<span id="cb35-4">imp <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mice</span>(nhanes_sub, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">m =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">seed =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">123</span>)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>
 iter imp variable
  1   1  BMI  BPSysAve  Diabetes
  1   2  BMI  BPSysAve  Diabetes
  1   3  BMI  BPSysAve  Diabetes
  2   1  BMI  BPSysAve  Diabetes
  2   2  BMI  BPSysAve  Diabetes
  2   3  BMI  BPSysAve  Diabetes
  3   1  BMI  BPSysAve  Diabetes
  3   2  BMI  BPSysAve  Diabetes
  3   3  BMI  BPSysAve  Diabetes
  4   1  BMI  BPSysAve  Diabetes
  4   2  BMI  BPSysAve  Diabetes
  4   3  BMI  BPSysAve  Diabetes
  5   1  BMI  BPSysAve  Diabetes
  5   2  BMI  BPSysAve  Diabetes
  5   3  BMI  BPSysAve  Diabetes</code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb37" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb37-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Look at a summary</span></span>
<span id="cb37-2">imp</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>Class: mids
Number of multiple imputations:  3 
Imputation methods:
      ID      Age   Gender      BMI BPSysAve Diabetes 
      ""       ""       ""    "pmm"    "pmm" "logreg" 
PredictorMatrix:
         ID Age Gender BMI BPSysAve Diabetes
ID        0   1      1   1        1        1
Age       1   0      1   1        1        1
Gender    1   1      0   1        1        1
BMI       1   1      1   0        1        1
BPSysAve  1   1      1   1        0        1
Diabetes  1   1      1   1        1        0</code></pre>
</div>
</div>
<p>The output shows:</p>
<ul>
<li><p><code>m = 3</code>: number of imputed datasets created.</p></li>
<li><p>For each variable with missingness, the method used for imputation.</p></li>
<li><p>How many iterations were performed in the algorithm.</p></li>
</ul>
<p>We can take a quick look at the imputed values:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb39" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb39-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Inspect first few imputations for BMI</span></span>
<span id="cb39-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">head</span>(imp<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>imp<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>BMI)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>        1     2     3
61  43.00 17.70 12.90
161 16.70 13.50 18.59
210 16.28 24.00 23.10
309 24.00 37.32 26.20
310 24.00 28.55 26.30
320 25.10 32.25 29.20</code></pre>
</div>
</div>
<p>This shows different plausible values for missing BMI observations across the three imputed datasets. Each dataset gives slightly different results, which is expected and important for reflecting uncertainty.</p>
<p>Once we have these imputations, we can <strong>complete the dataset</strong>:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb41" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb41-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Extract the first imputed dataset</span></span>
<span id="cb41-2">nhanes_completed <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">complete</span>(imp, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)</span>
<span id="cb41-3"></span>
<span id="cb41-4"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">head</span>(nhanes_completed)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>     ID Age Gender   BMI BPSysAve Diabetes
1 51624  34   male 32.22      113       No
2 51624  34   male 32.22      113       No
3 51624  34   male 32.22      113       No
4 51625   4   male 15.30       92       No
5 51630  49 female 30.57      112       No
6 51638   9   male 16.82       86       No</code></pre>
</div>
</div>
<p>Now we have a complete dataset with no missing values. In practice, we would analyze all imputed datasets and then combine results using Rubin’s rules, but the key takeaway here is:</p>
<ul>
<li><p><code>mice()</code> provides multiple versions of the data,</p></li>
<li><p>imputations are based on relationships among variables,</p></li>
<li><p>and the method preserves uncertainty rather than hiding it.</p></li>
</ul>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Note</span>MICE Essentials: Key Arguments
</div>
</div>
<div class="callout-body-container callout-body">
<ul>
<li><p><strong>method</strong>: Specifies the imputation model for each variable.</p>
<ul>
<li><code>pmm</code>: predictive mean matching (continuous variables)</li>
<li><code>logreg</code>: logistic regression (binary)</li>
<li><code>polyreg</code>: multinomial regression (nominal categorical)</li>
<li><code>polr</code>: proportional odds model (ordered categorical)</li>
</ul>
<p><em>Rule of thumb</em>: If a factor has &gt;2 levels, prefer <code>polyreg</code> (nominal) or <code>polr</code> (ordered) instead of <code>logreg</code>. Always check the actual levels of variables such as <code>Gender</code> or <code>Diabetes</code> in your data before setting methods.</p></li>
<li><p><strong>predictorMatrix</strong>: Controls which variables are used to predict others.</p>
<ul>
<li>Rows = target variables (to be imputed)</li>
<li>Columns = predictor variables</li>
</ul></li>
<li><p><strong>m</strong>: Number of multiple imputations to generate (commonly 5–20).</p>
<ul>
<li>More imputations recommended for high missingness.</li>
</ul></li>
<li><p><strong>maxit</strong>: Number of iterations of the chained equations (often 5–10).</p></li>
<li><p><strong>seed</strong>: Random seed for reproducibility.</p>
<ul>
<li>Always set when writing tutorials or reports.</li>
</ul></li>
</ul>
</div>
</div>
</section>
<section id="multiple-imputation-with-missforest" class="level4" data-number="6.3.2">
<h4 data-number="6.3.2" class="anchored" data-anchor-id="multiple-imputation-with-missforest"><span class="header-section-number">6.3.2</span> Multiple Imputation with missForest</h4>
<p>The <strong>missForest</strong> package provides a non-parametric imputation method based on random forests.<br>
Unlike <code>mice</code>, which generates multiple imputations, <code>missForest</code> creates a <strong>single completed dataset</strong> by iteratively predicting missing values using random forest models. It works well with both continuous and categorical variables and can capture nonlinear relationships.</p>
<p>We will use the same <code>nhanes_sub</code> dataset as before:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb43" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb43-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(dplyr)</span>
<span id="cb43-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(missForest)</span>
<span id="cb43-3"></span>
<span id="cb43-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Start from the existing subset:</span></span>
<span id="cb43-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># nhanes_sub &lt;- NHANES |&gt; select(ID, Age, Gender, BMI, BPSysAve, Diabetes)</span></span>
<span id="cb43-6"></span>
<span id="cb43-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># 1) Keep only model-relevant columns (drop pure identifier)</span></span>
<span id="cb43-8"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># 2) Convert character variables to factors (missForest expects factors, not raw character)</span></span>
<span id="cb43-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># 3) Coerce to base data.frame to avoid tibble-related method dispatch issues</span></span>
<span id="cb43-10">mf_input <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> nhanes_sub <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb43-11">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">select</span>(Age, Gender, BMI, BPSysAve, Diabetes) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb43-12">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">across</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">where</span>(is.character), as.factor)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb43-13">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">as.data.frame</span>()</span>
<span id="cb43-14"></span>
<span id="cb43-15"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">set.seed</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">123</span>)</span>
<span id="cb43-16">mf_fit <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">missForest</span>(</span>
<span id="cb43-17">  mf_input,</span>
<span id="cb43-18">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">ntree   =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">200</span>,    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># more trees -&gt; stabler imputations</span></span>
<span id="cb43-19">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">maxiter =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>,      <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># outer iterations (default 10; 5 is fine for demo)</span></span>
<span id="cb43-20">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">verbose =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span></span>
<span id="cb43-21">)</span>
<span id="cb43-22"></span>
<span id="cb43-23"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Completed data and OOB error</span></span>
<span id="cb43-24">mf_imputed <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> mf_fit<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>ximp</span>
<span id="cb43-25">mf_oob     <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> mf_fit<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>OOBerror</span>
<span id="cb43-26"></span>
<span id="cb43-27"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Quick checks</span></span>
<span id="cb43-28"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sum</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(mf_input<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>BMI))</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 366</code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb45" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb45-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sum</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(mf_imputed<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>BMI))   <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># should go to 0</span></span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 0</code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb47" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb47-1">mf_oob</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>     NRMSE        PFC 
0.17890307 0.02667884 </code></pre>
</div>
</div>
<p>The function returns a list with two key elements:</p>
<ul>
<li><p><code>ximp</code>: the completed dataset after imputation.</p></li>
<li><p><code>OOBerror</code>: the estimated imputation error (normalized root mean squared error for continuous variables and proportion of falsely classified entries for categorical variables).</p></li>
</ul>
<p><strong>Interpretation:</strong></p>
<ul>
<li><p>The <strong>completed dataset</strong> (<code>ximp</code>) replaces all missing values with imputed estimates.</p></li>
<li><p><strong>NRMSE (Normalized Root Mean Squared Error):</strong> <code>0.1789</code></p>
<ul>
<li>This value reflects the imputation error for continuous variables (e.g., <code>Age</code>, <code>BMI</code>, <code>BPSysAve</code>).</li>
<li>Since it is normalized, values closer to <strong>0</strong> indicate better accuracy. Here, an error of ~0.18 suggests that the imputed values are quite close to the true (non-missing) values.</li>
</ul></li>
<li><p><strong>PFC (Proportion of Falsely Classified):</strong> <code>0.0267</code></p>
<ul>
<li>This metric evaluates categorical variables (e.g., <code>Gender</code>, <code>Diabetes</code>).</li>
<li>A value of ~0.027 means only about <strong>2.7% of categorical imputations were misclassified</strong>, which is a strong performance.</li>
</ul></li>
</ul>
<p>✅ <strong>Interpretation:</strong><br>
The results indicate that <code>missForest</code> produced high-quality imputations: continuous variables are imputed with relatively low error, and categorical variables with very low misclassification. In practical terms, this means the dataset after imputation is reliable and close to the original data distribution.</p>
<p><strong>Pros and Cons of <code>missForest</code></strong></p>
<p><strong>Advantages:</strong></p>
<ul>
<li>Handles mixed data types (continuous + categorical).</li>
<li>Captures nonlinearities and complex interactions.</li>
<li>No need to specify an explicit imputation model.</li>
</ul>
<p><strong>Limitations:</strong></p>
<ul>
<li>Produces only a <strong>single imputed dataset</strong>, so uncertainty is not directly quantified (unlike <code>mice</code>).</li>
<li>Computationally more expensive for very large datasets.</li>
</ul>
</section>
</section>
</section>
<section id="single-vs.-multiple-imputation" class="level2" data-number="7">
<h2 data-number="7" class="anchored" data-anchor-id="single-vs.-multiple-imputation"><span class="header-section-number">7</span> Single vs.&nbsp;Multiple Imputation</h2>
<p>One critical distinction in handling missing data is <strong>single imputation</strong> vs.&nbsp;<strong>multiple imputation (MI)</strong>.</p>
<ul>
<li><p><strong>Single imputation</strong> (mean, median, regression, etc.) fills each missing value once. While simple, it <strong>ignores uncertainty</strong>, treating imputed values as if they were observed.</p></li>
<li><p><strong>Multiple imputation</strong> generates <strong>several plausible versions of the dataset</strong> (e.g., 5–10). Each dataset is analyzed separately, and results are then combined (pooled). This approach accounts for <strong>variability due to missingness</strong> and produces more reliable inferences.</p></li>
</ul>
<p>Let’s illustrate with our <code>nhanes_sub</code> dataset:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb49" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb49-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Complete-case analysis (ignores missing data)</span></span>
<span id="cb49-2">lm_cc <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lm</span>(BMI <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> Age <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> Gender <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> BPSysAve <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> Diabetes,</span>
<span id="cb49-3">            <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data =</span> nhanes_sub, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">na.action =</span> na.omit)</span>
<span id="cb49-4"></span>
<span id="cb49-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Single imputation (mean imputation for BMI)</span></span>
<span id="cb49-6">nhanes_single <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> nhanes_sub <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb49-7">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">BMI =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ifelse</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(BMI), <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(BMI, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">na.rm =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>), BMI))</span>
<span id="cb49-8"></span>
<span id="cb49-9">lm_si <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lm</span>(BMI <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> Age <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> Gender <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> BPSysAve <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> Diabetes,</span>
<span id="cb49-10">            <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data =</span> nhanes_single)</span>
<span id="cb49-11"></span>
<span id="cb49-12"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Multiple imputation with mice</span></span>
<span id="cb49-13">imp <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mice</span>(nhanes_sub, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">m =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">method =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"pmm"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">seed =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">123</span>)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>
 iter imp variable
  1   1  BMI  BPSysAve  Diabetes
  1   2  BMI  BPSysAve  Diabetes
  1   3  BMI  BPSysAve  Diabetes
  1   4  BMI  BPSysAve  Diabetes
  1   5  BMI  BPSysAve  Diabetes
  2   1  BMI  BPSysAve  Diabetes
  2   2  BMI  BPSysAve  Diabetes
  2   3  BMI  BPSysAve  Diabetes
  2   4  BMI  BPSysAve  Diabetes
  2   5  BMI  BPSysAve  Diabetes
  3   1  BMI  BPSysAve  Diabetes
  3   2  BMI  BPSysAve  Diabetes
  3   3  BMI  BPSysAve  Diabetes
  3   4  BMI  BPSysAve  Diabetes
  3   5  BMI  BPSysAve  Diabetes
  4   1  BMI  BPSysAve  Diabetes
  4   2  BMI  BPSysAve  Diabetes
  4   3  BMI  BPSysAve  Diabetes
  4   4  BMI  BPSysAve  Diabetes
  4   5  BMI  BPSysAve  Diabetes
  5   1  BMI  BPSysAve  Diabetes
  5   2  BMI  BPSysAve  Diabetes
  5   3  BMI  BPSysAve  Diabetes
  5   4  BMI  BPSysAve  Diabetes
  5   5  BMI  BPSysAve  Diabetes</code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb51" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb51-1">lm_mi <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">with</span>(imp, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lm</span>(BMI <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> Age <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> Gender <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> BPSysAve <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> Diabetes))</span>
<span id="cb51-2">pooled <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pool</span>(lm_mi)</span></code></pre></div></div>
</div>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb52" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb52-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summary</span>(lm_cc)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>
Call:
lm(formula = BMI ~ Age + Gender + BPSysAve + Diabetes, data = nhanes_sub, 
    na.action = na.omit)

Residuals:
    Min      1Q  Median      3Q     Max 
-18.771  -4.640  -1.053   3.553  53.003 

Coefficients:
             Estimate Std. Error t value Pr(&gt;|t|)    
(Intercept) 17.488597   0.513225  34.076   &lt;2e-16 ***
Age          0.047930   0.004273  11.217   &lt;2e-16 ***
Gendermale  -0.278756   0.144799  -1.925   0.0542 .  
BPSysAve     0.067504   0.004903  13.767   &lt;2e-16 ***
DiabetesYes  3.789606   0.265240  14.287   &lt;2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.588 on 8477 degrees of freedom
  (1518 observations deleted due to missingness)
Multiple R-squared:  0.1136,    Adjusted R-squared:  0.1132 
F-statistic: 271.5 on 4 and 8477 DF,  p-value: &lt; 2.2e-16</code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb54" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb54-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summary</span>(lm_si)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>
Call:
lm(formula = BMI ~ Age + Gender + BPSysAve + Diabetes, data = nhanes_single)

Residuals:
    Min      1Q  Median      3Q     Max 
-18.657  -4.634  -1.029   3.533  53.004 

Coefficients:
             Estimate Std. Error t value Pr(&gt;|t|)    
(Intercept) 17.600406   0.508667  34.601   &lt;2e-16 ***
Age          0.047159   0.004237  11.129   &lt;2e-16 ***
Gendermale  -0.287221   0.143820  -1.997   0.0458 *  
BPSysAve     0.066791   0.004858  13.748   &lt;2e-16 ***
DiabetesYes  3.725941   0.262781  14.179   &lt;2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.57 on 8541 degrees of freedom
  (1454 observations deleted due to missingness)
Multiple R-squared:  0.1118,    Adjusted R-squared:  0.1114 
F-statistic: 268.8 on 4 and 8541 DF,  p-value: &lt; 2.2e-16</code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb56" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb56-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summary</span>(pooled)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>         term    estimate   std.error statistic       df       p.value
1 (Intercept) 14.66559186 0.487686304 30.071773 1253.210 5.075329e-150
2         Age  0.10367759 0.003748806 27.656164 3916.798 6.073157e-154
3  Gendermale -0.30521870 0.134707691 -2.265785 3727.382  2.352165e-02
4    BPSysAve  0.06754332 0.004761820 14.184349 1160.000  3.121286e-42
5 DiabetesYes  3.23549109 0.260299313 12.429887 8866.664  3.529334e-35</code></pre>
</div>
</div>
<p>We applied three different approaches to handle missing BMI values in the <code>nhanes_sub</code> dataset, modeling <strong>BMI ~ Age + Gender + BPSysAve + Diabetes</strong>. Here is what we found:</p>
<p><strong>1. Complete-Case Analysis (CCA)</strong></p>
<ul>
<li><p><strong>What we did:</strong> We dropped all observations with missing values (<code>na.omit</code>).</p></li>
<li><p><strong>Result:</strong></p>
<ul>
<li><p><strong>Coefficients:</strong> Age (0.048), BPSysAve (0.068), DiabetesYes (+3.79), Gender slightly negative.</p></li>
<li><p><strong>Standard errors:</strong> Relatively large because ~1500 observations were discarded.</p></li>
<li><p><strong>R²:</strong> 0.114 — fairly low.</p></li>
</ul></li>
<li><p><strong>Takeaway:</strong> CCA wastes data and may bias estimates if missingness is not MCAR (Missing Completely at Random).</p></li>
</ul>
<p><strong>2. Single Imputation (Mean Substitution for BMI)</strong></p>
<ul>
<li><p><strong>What we did:</strong> Replaced missing BMI values with the mean BMI.</p></li>
<li><p><strong>Result:</strong></p>
<ul>
<li><p><strong>Coefficients:</strong> Very close to CCA (Age 0.047, BPSysAve 0.067, DiabetesYes +3.73).</p></li>
<li><p><strong>Gender</strong> effect became just significant (<em>p</em> = 0.045).</p></li>
<li><p><strong>Residual SE</strong> decreased slightly (6.57).</p></li>
</ul></li>
<li><p><strong>Takeaway:</strong> Looks “better” because all observations are retained, but this approach <strong>ignores imputation uncertainty</strong> and artificially stabilizes estimates. Standard errors are underestimated, leading to overconfidence.</p></li>
</ul>
<p><strong>3. Multiple Imputation (MI with <code>mice</code>, m = 5, method = “pmm”)</strong></p>
<ul>
<li><p><strong>What we did:</strong> Generated 5 imputed datasets using Predictive Mean Matching (PMM), fit the same model in each, and pooled results.</p></li>
<li><p><strong>Result:</strong></p>
<ul>
<li><p><strong>Coefficients:</strong> Age effect doubled (0.104), intercept dropped (14.7 vs.&nbsp;~17.5), Diabetes effect slightly smaller (+3.24), Gender effect remained modest but significant (<em>p</em> = 0.023).</p></li>
<li><p><strong>Standard errors:</strong> Properly adjusted upwards — reflecting real uncertainty in imputed BMI values.</p></li>
<li><p><strong>Inference:</strong> Despite differences in point estimates, the conclusions are more <strong>statistically honest</strong>.</p></li>
</ul></li>
<li><p><strong>Takeaway:</strong> MI balances efficiency (uses all data) and validity (acknowledges missingness uncertainty).</p></li>
</ul>
<p><strong>🔑 Overall Comparison</strong></p>
<table class="caption-top table">
<colgroup>
<col style="width: 19%">
<col style="width: 16%">
<col style="width: 22%">
<col style="width: 22%">
<col style="width: 19%">
</colgroup>
<thead>
<tr class="header">
<th>Method</th>
<th>Keeps All Data</th>
<th>Coefficients Similar?</th>
<th>SE Adjusted for Uncertainty?</th>
<th>Main Issue</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Complete Case (CCA)</td>
<td>❌ (~1500 rows lost)</td>
<td>Yes, but less precise</td>
<td>✅ (but biased if MAR/MNAR)</td>
<td>Data loss, possible bias</td>
</tr>
<tr class="even">
<td>Single Imputation (SI)</td>
<td>✅</td>
<td>Similar to CCA</td>
<td>❌ Underestimated</td>
<td>Overconfident inference</td>
</tr>
<tr class="odd">
<td>Multiple Imputation (MI)</td>
<td>✅</td>
<td>Somewhat different (esp.&nbsp;Age)</td>
<td>✅ Properly adjusted</td>
<td>More computation needed</td>
</tr>
</tbody>
</table>
<p><strong>Interpretation:</strong></p>
<ul>
<li><p><strong>Complete-case</strong> drops too much data and risks bias.</p></li>
<li><p><strong>Single imputation</strong> keeps the data but gives <em>too much confidence</em> in results.</p></li>
<li><p><strong>Multiple imputation</strong> changes some coefficients (notably Age) and reports more realistic uncertaint</p></li>
</ul>
<p>👉 <strong>Lesson:</strong> If your goal is valid inference, especially in epidemiological or social science settings, <strong>multiple imputation is the gold standard</strong>.</p>
</section>
<section id="comparison-of-common-imputation-methods" class="level2" data-number="8">
<h2 data-number="8" class="anchored" data-anchor-id="comparison-of-common-imputation-methods"><span class="header-section-number">8</span> Comparison of Common Imputation Methods</h2>
<table class="caption-top table">
<colgroup>
<col style="width: 22%">
<col style="width: 31%">
<col style="width: 23%">
<col style="width: 22%">
</colgroup>
<thead>
<tr class="header">
<th>Method</th>
<th>Description</th>
<th>Advantages</th>
<th>Disadvantages</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Listwise Deletion</td>
<td>Removes all observations containing missing values</td>
<td>Very simple, quick to implement</td>
<td>Substantial data loss, potential bias</td>
</tr>
<tr class="even">
<td>Mean / Median / Mode</td>
<td>Replaces missing values with a fixed statistic</td>
<td>Easy to apply, preserves sample size</td>
<td>Reduces variance, distorts relationships</td>
</tr>
<tr class="odd">
<td>LOCF (Last Observation Carried Forward)</td>
<td>Uses the last available value (mainly time series)</td>
<td>Useful in longitudinal data, preserves continuity</td>
<td>Ignores trends, underestimates variability</td>
</tr>
<tr class="even">
<td>Linear Interpolation</td>
<td>Estimates missing values by connecting known data points</td>
<td>Maintains trends, intuitive</td>
<td>Fails with sudden changes or nonlinear patterns</td>
</tr>
<tr class="odd">
<td>KNN Imputation</td>
<td>Predicts missing values using nearest neighbors</td>
<td>Preserves multivariate structure, flexible</td>
<td>Computationally expensive, sensitive to k choice</td>
</tr>
<tr class="even">
<td>MICE (Multiple Imputation by Chained Equations)</td>
<td>Iterative regression-based multiple imputation</td>
<td>Accounts for uncertainty, widely used in research</td>
<td>Time-consuming, requires expertise</td>
</tr>
<tr class="odd">
<td>missForest</td>
<td>Uses Random Forest to impute missing values</td>
<td>Handles nonlinearities and interactions</td>
<td>Black-box method, computationally intensive</td>
</tr>
<tr class="even">
<td>EM Algorithm</td>
<td>Iterative expectation-maximization for likelihood-based estimation</td>
<td>Statistically principled, robust in theory</td>
<td>Requires strong assumptions, advanced knowledge</td>
</tr>
</tbody>
</table>
<p>No single imputation method is universally optimal—each comes with trade-offs between simplicity, accuracy, and interpretability. For instance, <strong>listwise deletion</strong> is tempting for its ease but can heavily bias results if missingness is not random. Simple <strong>mean or median imputation</strong> keeps the dataset intact but artificially reduces variability and masks true correlations. More advanced techniques such as <strong>MICE, missForest, and EM</strong> provide statistically sound imputations that preserve uncertainty and relationships, but they demand more computational resources and methodological expertise.</p>
<p>In practice:</p>
<ul>
<li><p><strong>Exploratory analysis</strong> often starts with simple methods (e.g., median replacement) to get a sense of the data.</p></li>
<li><p><strong>Time series data</strong> may rely on <strong>LOCF or interpolation</strong>.</p></li>
<li><p><strong>Complex survey or clinical datasets</strong> typically benefit from advanced approaches like <strong>MICE</strong> or <strong>missForest</strong>, which better respect the multivariate nature of the data.</p></li>
</ul>
<p>Ultimately, the choice depends on the <strong>data structure, missingness mechanism (MCAR, MAR, MNAR), and analytical goals</strong>.</p>
</section>
<section id="conclusion" class="level2" data-number="9">
<h2 data-number="9" class="anchored" data-anchor-id="conclusion"><span class="header-section-number">9</span> Conclusion</h2>
<p>There is <strong>no one-size-fits-all</strong> solution for missing data. The right approach depends on your <strong>goal (prediction vs.&nbsp;inference)</strong>, the <strong>missingness mechanism (MCAR/MAR/MNAR)</strong>, your <strong>data structure</strong> (cross-sectional vs.&nbsp;longitudinal), and <strong>practical constraints</strong> (time, compute, expertise).</p>
<section id="what-our-nhanes-walkthrough-showed" class="level3" data-number="9.1">
<h3 data-number="9.1" class="anchored" data-anchor-id="what-our-nhanes-walkthrough-showed"><span class="header-section-number">9.1</span> What our NHANES walkthrough showed</h3>
<ul>
<li><strong>Complete-case analysis</strong> is simple but wastes data and can bias results unless MCAR is plausible.</li>
<li><strong>Single imputation</strong> (mean/median, kNN, missForest run once) keeps all rows but <strong>underestimates uncertainty</strong>, yielding overconfident inferences.</li>
<li><strong>Multiple imputation (MICE)</strong> typically strikes the best balance for <strong>inference under MAR</strong>: it preserves multivariate structure and <strong>propagates uncertainty</strong> (via pooling), producing more honest standard errors and CIs.</li>
<li><strong>Nonparametric imputers</strong> like <strong>missForest</strong> are strong for <strong>predictive accuracy</strong> on complex, nonlinear structure, but they do <strong>not</strong> capture imputation uncertainty by themselves.</li>
</ul>
</section>
<section id="practical-guidance-decision-oriented" class="level3" data-number="9.2">
<h3 data-number="9.2" class="anchored" data-anchor-id="practical-guidance-decision-oriented"><span class="header-section-number">9.2</span> Practical guidance (decision-oriented)</h3>
<ul>
<li><strong>If your main task is prediction</strong> and interpretability is secondary → a good single-imputation engine (e.g., <strong>missForest</strong>) can be effective, with careful validation.</li>
<li><strong>If your main task is inference</strong> (effect sizes, CIs, p-values) and <strong>MAR is reasonable</strong> → prefer <strong>MICE</strong>; include strong predictors of both the outcome and missingness; check diagnostics.</li>
<li><strong>If you suspect MNAR</strong> → acknowledge this explicitly and consider <strong>sensitivity analyses</strong> (pattern-mixture/selection models) rather than assuming MAR.</li>
</ul>
</section>
<section id="reporting-checklist-make-your-analysis-reproducible-credible" class="level3" data-number="9.3">
<h3 data-number="9.3" class="anchored" data-anchor-id="reporting-checklist-make-your-analysis-reproducible-credible"><span class="header-section-number">9.3</span> Reporting checklist (make your analysis reproducible &amp; credible)</h3>
<ul>
<li>% missing <strong>by variable</strong> and <strong>by key subgroups</strong> (e.g., Age, Gender).</li>
<li>Your <strong>assumed mechanism</strong> (MCAR/MAR/MNAR) and why it’s plausible.</li>
<li>The <strong>method(s)</strong> used (e.g., MICE with <code>pmm</code>, <code>m</code>, <code>maxit</code>, <code>predictorMatrix</code>; or missForest with <code>ntree</code>, <code>maxiter</code>).</li>
<li><strong>Diagnostics</strong> (trace/density/strip plots for MICE; OOB error for missForest).</li>
<li>For MI: <strong>pooled estimates</strong> with standard errors/intervals; clarify how pooling was performed.</li>
<li><strong>Limitations</strong> (e.g., potential MNAR, model misspecification, small-sample caveats).</li>
</ul>
<div class="callout callout-style-default callout-tip callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Tip
</div>
</div>
<div class="callout-body-container callout-body">
<p><strong>Rule of thumb.</strong> Use simple methods for quick EDA; use <strong>MICE</strong> for publication-grade inference under MAR; use <strong>missForest</strong> when you primarily need strong <strong>predictive performance</strong> on mixed/complex data.</p>
</div>
</div>
</section>
<section id="common-pitfalls-to-avoid" class="level3" data-number="9.4">
<h3 data-number="9.4" class="anchored" data-anchor-id="common-pitfalls-to-avoid"><span class="header-section-number">9.4</span> Common pitfalls to avoid</h3>
<ul>
<li>Treating imputed values as if they were observed “truth” (single imputation + significance testing).</li>
<li>Imputing the <strong>outcome</strong> itself (generally avoid; let it <em>inform</em> predictor imputations instead).</li>
<li>Ignoring <strong>leakage</strong>: fit imputers <strong>within</strong> resampling folds/splits, not on the full data.</li>
<li>Omitting key covariates that explain missingness (weakens the MAR assumption and the imputer).</li>
</ul>
</section>
<section id="where-to-go-next" class="level3" data-number="9.5">
<h3 data-number="9.5" class="anchored" data-anchor-id="where-to-go-next"><span class="header-section-number">9.5</span> Where to go next</h3>
<ul>
<li><strong>Leakage-free pipelines</strong> with <code>tidymodels::recipes</code> (train/test split done right).</li>
<li><strong>Sensitivity analyses</strong> for MNAR.</li>
<li><strong>Robustness checks</strong> (alternative imputation models, different <code>m</code>, predictor sets).</li>
</ul>
<p><strong>Bottom line:</strong> Choose methods intentionally, <strong>justify assumptions</strong>, show diagnostics, and <strong>report pooled results</strong> when using MI. Good missing-data practice is less about one magic function and more about transparent, principled workflow.</p>
</section>
</section>
<section id="references" class="level2" data-number="10">
<h2 data-number="10" class="anchored" data-anchor-id="references"><span class="header-section-number">10</span> References</h2>
<ul>
<li><p>Allison, P. D. (2001). <em>Missing Data</em>. Sage Publications.</p></li>
<li><p>Enders, C. K. (2010). <em>Applied Missing Data Analysis</em>. The Guilford Press.</p></li>
<li><p>Little, R. J. A., &amp; Rubin, D. B. (2002). <em>Statistical Analysis with Missing Data</em> (2nd ed.). Wiley.</p></li>
<li><p>van Buuren, S. (2018). <em>Flexible Imputation of Missing Data</em> (2nd ed.). Chapman &amp; Hall/CRC.</p></li>
<li><p>Stekhoven, D. J., &amp; Bühlmann, P. (2012). “MissForest—Nonparametric Missing Value Imputation for Mixed-Type Data.” <em>Bioinformatics</em>, 28(1), 112–118. https://doi.org/10.1093/bioinformatics/btr597</p></li>
<li><p>Rubin, D. B. (1987). <em>Multiple Imputation for Nonresponse in Surveys</em>. Wiley.</p></li>
<li><p>Schafer, J. L. (1997). <em>Analysis of Incomplete Multivariate Data</em>. Chapman &amp; Hall/CRC.</p></li>
<li><p>R Documentation: <a href="https://cran.r-project.org/package=mice"><code>mice</code> package</a></p></li>
<li><p>R Documentation: <a href="https://cran.r-project.org/package=missForest"><code>missForest</code> package</a></p></li>
</ul>


<!-- -->

</section>

 ]]></description>
  <category>R</category>
  <category>Statistics</category>
  <category>Data Analysis</category>
  <category>Data Science</category>
  <category>Data Preprocessing</category>
  <category>Missing Data</category>
  <category>Data Cleaning</category>
  <category>Imputation</category>
  <guid>https://mfatihtuzen.github.io/posts/2025-08-18_missing_values/</guid>
  <pubDate>Mon, 18 Aug 2025 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Standard Deviation vs. Standard Error: Meaning, Misuse, and the Math Behind the Confusion</title>
  <dc:creator>M. Fatih Tüzen</dc:creator>
  <link>https://mfatihtuzen.github.io/posts/2025-06-11_sd_vs_se/</link>
  <description><![CDATA[ 






<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://mfatihtuzen.github.io/posts/2025-06-11_sd_vs_se/sd_vs_se.png" class="img-fluid figure-img" style="width:80.0%"></p>
<figcaption>The left side illustrates standard deviation as the spread of individual data values around the population mean (μ). The right side shows standard error as the variability in sample means (x̄) obtained from repeated sampling. Notice how the SE distribution is narrower—it represents uncertainty in the estimate, not variability in the raw data.</figcaption>
</figure>
</div>
<section id="introduction-why-this-confusion-still-matters" class="level2" data-number="1">
<h2 data-number="1" class="anchored" data-anchor-id="introduction-why-this-confusion-still-matters"><span class="header-section-number">1</span> Introduction: Why This Confusion Still Matters</h2>
<p>In the world of data analysis and statistics, <strong>standard deviation (SD)</strong> and <strong>standard error (SE)</strong> are two concepts that are often misunderstood or—worse—used interchangeably. This confusion isn’t just academic: misinterpreting these two measures can lead to poor conclusions, misleading visualizations, and incorrect inferences, especially in reports intended for non-technical audiences.</p>
<p>Think about this: you read a news article stating that <em>“the average income of a sample group is $3,000 with a standard error of $500.”</em> But then another article says <em>“the same average income with a standard deviation of $500.”</em> Should your level of confidence change? Absolutely—because they tell two fundamentally different stories.</p>
<p>This article aims to:</p>
<ul>
<li>Define and differentiate standard deviation and standard error,</li>
<li>Explore their mathematical foundations,</li>
<li>Demonstrate their practical implications with real R code and visuals,</li>
<li>Warn about common pitfalls and interpretation mistakes.</li>
</ul>
<p>By the end of this post, you’ll not only <strong>understand the difference</strong> but also <strong>know exactly when and why each metric matters</strong>.</p>
</section>
<section id="definitions-and-mathematical-foundation" class="level2" data-number="2">
<h2 data-number="2" class="anchored" data-anchor-id="definitions-and-mathematical-foundation"><span class="header-section-number">2</span> Definitions and Mathematical Foundation</h2>
<p>Understanding the difference between <strong>standard deviation</strong> and <strong>standard error</strong> requires going beyond surface-level definitions. While they are mathematically related, they answer fundamentally different questions.</p>
<section id="standard-deviation-sd" class="level3" data-number="2.1">
<h3 data-number="2.1" class="anchored" data-anchor-id="standard-deviation-sd"><span class="header-section-number">2.1</span> Standard Deviation (SD)</h3>
<p>Standard deviation is a measure of <strong>variability</strong> or <strong>dispersion</strong> within a single dataset. It tells us how far individual observations tend to deviate from the sample (or population) mean.</p>
<p>Mathematically, for a sample of size <img src="https://latex.codecogs.com/png.latex?n">, the sample standard deviation is given by:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0As%20=%20%5Csqrt%7B%20%5Cfrac%7B1%7D%7Bn%20-%201%7D%20%5Csum_%7Bi=1%7D%5E%7Bn%7D%20(x_i%20-%20%5Cbar%7Bx%7D)%5E2%20%7D%0A"></p>
<p>Where:</p>
<ul>
<li><img src="https://latex.codecogs.com/png.latex?x_i">: Each data point</li>
<li><img src="https://latex.codecogs.com/png.latex?%5Cbar%7Bx%7D">: Sample mean</li>
<li><img src="https://latex.codecogs.com/png.latex?n">: Number of observations</li>
</ul>
<p>Standard deviation is widely used in descriptive statistics to understand how <strong>spread out</strong> the values in a dataset are. A large SD implies high variability, while a small SD suggests the values are clustered closely around the mean.</p>
<blockquote class="blockquote">
<p>📌 <strong>Use case</strong>: “How much do individual students’ test scores vary from the class average?”</p>
</blockquote>
</section>
<section id="standard-error-se" class="level3" data-number="2.2">
<h3 data-number="2.2" class="anchored" data-anchor-id="standard-error-se"><span class="header-section-number">2.2</span> Standard Error (SE)</h3>
<p>Standard error, in contrast, is a measure of <strong>precision</strong>—specifically, the precision of an estimate like the sample mean. It tells us how much the <strong>sample mean</strong> would vary if we repeatedly drew samples from the population.</p>
<p>It is defined as:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Ctext%7BSE%7D%20=%20%5Cfrac%7Bs%7D%7B%5Csqrt%7Bn%7D%7D%0A"></p>
<p>As you can see, SE is directly related to the standard deviation but scaled down by the square root of the sample size. This reflects the idea that <strong>more data gives more precise estimates</strong>.</p>
<blockquote class="blockquote">
<p>📌 <strong>Use case</strong>: “How much uncertainty is there in the sample mean as an estimate of the population mean?”</p>
</blockquote>
<hr>
<p>In short:</p>
<table class="caption-top table">
<colgroup>
<col style="width: 19%">
<col style="width: 32%">
<col style="width: 24%">
<col style="width: 24%">
</colgroup>
<thead>
<tr class="header">
<th>Concept</th>
<th>Measures</th>
<th>Based on</th>
<th>Affected by Sample Size</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Standard Deviation</td>
<td>Spread of individual data points</td>
<td>Individual observations</td>
<td>❌ No</td>
</tr>
<tr class="even">
<td>Standard Error</td>
<td>Uncertainty in the sample mean</td>
<td>Sampling distribution</td>
<td>✅ Yes</td>
</tr>
</tbody>
</table>
<p>Understanding this distinction is critical for drawing correct conclusions—especially in inferential statistics, confidence intervals, and hypothesis testing.</p>
</section>
</section>
<section id="visualizing-the-difference-with-r-simulation-and-interpretation" class="level2" data-number="3">
<h2 data-number="3" class="anchored" data-anchor-id="visualizing-the-difference-with-r-simulation-and-interpretation"><span class="header-section-number">3</span> Visualizing the Difference with R: Simulation and Interpretation</h2>
<p>Let’s use R to <strong>visualize</strong> and <strong>truly understand</strong> the difference between standard deviation and standard error.</p>
<p>We’ll start by generating a single random sample from a known population and examining the spread of individual values. Then, we’ll simulate multiple samples to show how the sample means vary—and how that variation reflects the standard error.</p>
<section id="standard-deviation-spread-of-values-within-a-sample" class="level3" data-number="3.1">
<h3 data-number="3.1" class="anchored" data-anchor-id="standard-deviation-spread-of-values-within-a-sample"><span class="header-section-number">3.1</span> Standard Deviation: Spread of Values Within a Sample</h3>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">set.seed</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">42</span>)</span>
<span id="cb1-2">sample_data <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rnorm</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">50</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">mean =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sd =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">15</span>)</span></code></pre></div></div>
</div>
<p>We generate 50 values from a normal distribution with a mean of 100 and a standard deviation of 15. This mimics a situation like measuring the heights, weights, or incomes of 50 individuals.</p>
<p>Let’s visualize how these values are distributed.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb2-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(ggplot2)</span>
<span id="cb2-2"></span>
<span id="cb2-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggplot</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> sample_data), <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> x)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb2-4">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_histogram</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> ..density..), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">binwidth =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">fill =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"steelblue"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">color =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"white"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">alpha =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.6</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb2-5">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_density</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">color =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"black"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">linewidth =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.2</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">linetype =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"solid"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb2-6">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_vline</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">xintercept =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(x)), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">color =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"red"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">linetype =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"dashed"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">linewidth =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb2-7">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">labs</span>(</span>
<span id="cb2-8">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">title =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Standard Deviation: Spread of Individual Values"</span>,</span>
<span id="cb2-9">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Value"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Density"</span></span>
<span id="cb2-10">  )</span></code></pre></div></div>
<div class="cell-output-display">
<div>
<figure class="figure">
<p><img src="https://mfatihtuzen.github.io/posts/2025-06-11_sd_vs_se/index_files/figure-html/unnamed-chunk-2-1.png" class="img-fluid figure-img" width="672"></p>
</figure>
</div>
</div>
</div>
<p><strong>What This Graph Shows</strong></p>
<ul>
<li><p>The histogram shows the <strong>distribution of raw data</strong> from our single sample.</p></li>
<li><p>The black curve is a <strong>kernel density estimate</strong>, giving us a smooth representation of the distribution.</p></li>
<li><p>The red dashed line marks the <strong>sample mean</strong>.</p></li>
<li><p>The <strong>spread around this mean</strong>—the “thickness” of the histogram—is what the <strong>standard deviation</strong> quantifies.</p></li>
</ul>
<p>So, in simple terms: <strong>standard deviation tells us how much individual values differ from their mean in one sample</strong>. It answers the question:</p>
<blockquote class="blockquote">
<p>“Are most values close to the average, or are they all over the place?”</p>
</blockquote>
</section>
<section id="standard-error-spread-of-sample-means-across-repeated-samples" class="level3" data-number="3.2">
<h3 data-number="3.2" class="anchored" data-anchor-id="standard-error-spread-of-sample-means-across-repeated-samples"><span class="header-section-number">3.2</span> Standard Error: Spread of Sample Means Across Repeated Samples</h3>
<p>Now let’s go one level deeper. Instead of looking at one sample, let’s imagine we repeatedly draw many samples from the same population, each of size 50, and record their means.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb3-1">sample_means <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">replicate</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span>, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rnorm</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">50</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">mean =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sd =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">15</span>)))</span></code></pre></div></div>
</div>
<p>Let’s see how those means are distributed:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb4-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggplot</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">mean =</span> sample_means), <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> mean)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb4-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_histogram</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> ..density..), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">binwidth =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">fill =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"darkorange"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">color =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"white"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">alpha =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.7</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb4-3">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_density</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">color =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"black"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">linewidth =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.2</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">linetype =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"solid"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb4-4">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_vline</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">xintercept =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(mean)), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">color =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"red"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">linetype =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"dashed"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">linewidth =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb4-5">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">labs</span>(</span>
<span id="cb4-6">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">title =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Standard Error: Variability of Sample Means"</span>,</span>
<span id="cb4-7">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Sample Mean"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Density"</span></span>
<span id="cb4-8">  )</span></code></pre></div></div>
<div class="cell-output-display">
<div>
<figure class="figure">
<p><img src="https://mfatihtuzen.github.io/posts/2025-06-11_sd_vs_se/index_files/figure-html/unnamed-chunk-4-1.png" class="img-fluid figure-img" width="672"></p>
</figure>
</div>
</div>
</div>
<p><strong>What This Graph Shows</strong></p>
<ul>
<li><p>Each bar in the histogram represents the <strong>frequency of sample means</strong> in a small range.</p></li>
<li><p>The curve again shows the <strong>estimated density</strong> of the sample means.</p></li>
<li><p>The red dashed line is the <strong>grand mean</strong> of all 1,000 sample means—it should be close to 100.</p></li>
<li><p>Unlike the previous graph, here we don’t see individual values but <strong>mean values from many samples</strong>.</p></li>
</ul>
<p>This distribution is known as the <strong>sampling distribution of the sample mean</strong>.</p>
<p>And the <strong>standard deviation of this distribution is the standard error</strong>:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb5-1">se_estimate <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sd</span>(sample_means)</span>
<span id="cb5-2">se_estimate</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 2.113943</code></pre>
</div>
</div>
</section>
<section id="interpretation-two-types-of-spread-two-different-questions" class="level3" data-number="3.3">
<h3 data-number="3.3" class="anchored" data-anchor-id="interpretation-two-types-of-spread-two-different-questions"><span class="header-section-number">3.3</span> Interpretation: Two Types of Spread, Two Different Questions</h3>
<p>Let’s pause and reflect on what we’ve seen so far.</p>
<p>Although <strong>standard deviation</strong> and <strong>standard error</strong> are both measures of “spread,” they describe very different things, answer different questions, and are used in different contexts.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 16%">
<col style="width: 39%">
<col style="width: 18%">
<col style="width: 26%">
</colgroup>
<thead>
<tr class="header">
<th>Concept</th>
<th>What it Measures</th>
<th>Based on…</th>
<th>Changes with Sample Size (<img src="https://latex.codecogs.com/png.latex?n">)</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Standard Deviation</td>
<td>Spread of individual data values</td>
<td>Single sample</td>
<td>❌ No</td>
</tr>
<tr class="even">
<td>Standard Error</td>
<td>Spread of sample means across repeated samples</td>
<td>Sampling distribution</td>
<td>✅ Yes</td>
</tr>
</tbody>
</table>
<hr>
<section id="summary-of-interpretation" class="level4" data-number="3.3.1">
<h4 data-number="3.3.1" class="anchored" data-anchor-id="summary-of-interpretation"><span class="header-section-number">3.3.1</span> Summary of Interpretation</h4>
<ul>
<li><p><strong>Standard deviation (SD)</strong> tells us:<br>
&gt; <em>“How much do individual values differ from the average within a sample?”</em></p></li>
<li><p><strong>Standard error (SE)</strong> tells us:<br>
&gt; <em>“How much would the sample average vary if we repeated the sampling?”</em></p></li>
</ul>
<p>In other words:</p>
<ul>
<li><strong>SD measures natural variability</strong> among individuals (or observations).</li>
<li><strong>SE measures the statistical uncertainty</strong> of an estimate, usually the sample mean.</li>
</ul>
<p>This difference is not just semantic—it has <strong>critical consequences</strong> for data interpretation:</p>
<ul>
<li>You use <strong>SD</strong> when describing the spread of your sample or population.</li>
<li>You use <strong>SE</strong> when making inferences, estimating confidence intervals, or assessing how trustworthy your sample statistic is.</li>
</ul>
</section>
<section id="the-mathematical-connection" class="level4" data-number="3.3.2">
<h4 data-number="3.3.2" class="anchored" data-anchor-id="the-mathematical-connection"><span class="header-section-number">3.3.2</span> The Mathematical Connection</h4>
<p>As we saw earlier, the standard error is mathematically derived from the standard deviation:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Ctext%7BSE%7D%20=%20%5Cfrac%7Bs%7D%7B%5Csqrt%7Bn%7D%7D%0A"></p>
<p>This formula reveals a fundamental principle in statistics:</p>
<ul>
<li>The more data you collect (larger <img src="https://latex.codecogs.com/png.latex?n">), the <strong>more stable</strong> your sample mean becomes.</li>
<li>However, the <strong>variability within the sample</strong> (standard deviation <img src="https://latex.codecogs.com/png.latex?s">) may remain roughly the same—because it depends on the population, not on how many observations you took.</li>
</ul>
<blockquote class="blockquote">
<p>🧠 <strong>Key insight</strong>:<br>
<strong>Standard deviation reflects the reality of your data.</strong><br>
<strong>Standard error reflects your uncertainty about the mean.</strong></p>
</blockquote>
</section>
</section>
</section>
<section id="common-mistakes-and-misinterpretations" class="level2" data-number="4">
<h2 data-number="4" class="anchored" data-anchor-id="common-mistakes-and-misinterpretations"><span class="header-section-number">4</span> Common Mistakes and Misinterpretations</h2>
<p>Despite their differences, standard deviation and standard error are frequently confused—even in academic papers, business reports, and media articles. Below are some of the most common mistakes and why they matter.</p>
<section id="mistake-1-using-standard-error-instead-of-standard-deviation-in-descriptive-summaries" class="level3" data-number="4.1">
<h3 data-number="4.1" class="anchored" data-anchor-id="mistake-1-using-standard-error-instead-of-standard-deviation-in-descriptive-summaries"><span class="header-section-number">4.1</span> Mistake 1: Using Standard Error Instead of Standard Deviation in Descriptive Summaries</h3>
<p>A classic mistake is reporting the <strong>standard error</strong> when trying to describe how spread out individual values are.</p>
<blockquote class="blockquote">
<p>❌ <em>“The average score was 80 ± 2 (SE)”</em><br>
✅ <em>“The average score was 80 ± 2 (SD)”</em></p>
</blockquote>
<p>In descriptive statistics—such as reporting the results of a survey, an experiment, or a class performance—you almost always want to use the <strong>standard deviation</strong>, because it reflects <strong>individual variability</strong>.</p>
<blockquote class="blockquote">
<p>📌 The standard error, by contrast, only makes sense if your goal is to communicate <strong>how uncertain your estimate of the mean is</strong>, not how diverse the sample is.</p>
</blockquote>
<hr>
</section>
<section id="mistake-2-adding-error-bars-to-a-barplot-without-clarifying-whether-its-sd-or-se" class="level3" data-number="4.2">
<h3 data-number="4.2" class="anchored" data-anchor-id="mistake-2-adding-error-bars-to-a-barplot-without-clarifying-whether-its-sd-or-se"><span class="header-section-number">4.2</span> Mistake 2: Adding Error Bars to a Barplot Without Clarifying Whether It’s SD or SE</h3>
<p>Barplots with error bars are everywhere—but often, those bars are unlabeled, or worse, mislabeled.</p>
<ul>
<li>If the error bars are <strong>standard deviation</strong>, they show the <strong>range of variation</strong> in the data.</li>
<li>If they are <strong>standard error</strong>, they show the <strong>precision of the mean estimate</strong>.</li>
</ul>
<p>Yet many charts leave this ambiguous or assume the reader will infer it.</p>
<blockquote class="blockquote">
<p>✏️ <strong>Always label your error bars.</strong> In R and ggplot2, you can add <code>labs(caption = "Error bars represent ±1 SE")</code> to avoid confusion.</p>
</blockquote>
<hr>
</section>
<section id="mistake-3-believing-that-se-can-describe-the-samples-spread" class="level3" data-number="4.3">
<h3 data-number="4.3" class="anchored" data-anchor-id="mistake-3-believing-that-se-can-describe-the-samples-spread"><span class="header-section-number">4.3</span> Mistake 3: Believing That SE Can Describe the Sample’s Spread</h3>
<p>Another subtle misinterpretation is thinking that <strong>a small SE implies the data itself is tightly clustered</strong>. But SE has nothing to do with spread among individual values.</p>
<blockquote class="blockquote">
<p>A sample can have high variability (large SD), but still have a small SE if the sample size is large.</p>
</blockquote>
<p>This is especially misleading in <strong>clinical trials</strong> or <strong>public health studies</strong>, where the sample size might be very large—but individual responses vary wildly.</p>
<blockquote class="blockquote">
<p>📉 Low SE ≠ Low diversity. It just means you’re confident about the average.</p>
</blockquote>
<hr>
</section>
<section id="mistake-4-reporting-se-without-context" class="level3" data-number="4.4">
<h3 data-number="4.4" class="anchored" data-anchor-id="mistake-4-reporting-se-without-context"><span class="header-section-number">4.4</span> Mistake 4: Reporting SE Without Context</h3>
<p>It’s not uncommon to see a mean value with a standard error reported like this:</p>
<blockquote class="blockquote">
<p><em>“Mean blood pressure: 132 ± 1.5”</em></p>
</blockquote>
<p>This may seem informative—but <strong>without knowing the sample size</strong>, this value has limited meaning.</p>
<p>Why? Because SE is <strong>dependent on</strong> <img src="https://latex.codecogs.com/png.latex?n">. A standard error of 1.5 from 10 observations is very different from the same SE based on 10,000 observations.</p>
<blockquote class="blockquote">
<p>✔️ Always include the <strong>sample size</strong> and preferably also the <strong>standard deviation</strong>, especially if the goal is transparency and reproducibility.</p>
</blockquote>
<hr>
</section>
<section id="final-rule-of-thumb" class="level3" data-number="4.5">
<h3 data-number="4.5" class="anchored" data-anchor-id="final-rule-of-thumb"><span class="header-section-number">4.5</span> Final Rule of Thumb</h3>
<table class="caption-top table">
<thead>
<tr class="header">
<th>If you want to…</th>
<th>Use…</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Describe how individuals vary</td>
<td>Standard Deviation</td>
</tr>
<tr class="even">
<td>Quantify uncertainty about the sample mean</td>
<td>Standard Error</td>
</tr>
<tr class="odd">
<td>Construct a confidence interval</td>
<td>Standard Error</td>
</tr>
<tr class="even">
<td>Show variability in raw data</td>
<td>Standard Deviation</td>
</tr>
</tbody>
</table>
<p>By respecting the purpose and proper use of these two measures, you’ll avoid misleading your audience—and build more trust in your analyses.</p>
</section>
</section>
<section id="a-real-world-example-monthly-spending-survey-in-usd" class="level2" data-number="5">
<h2 data-number="5" class="anchored" data-anchor-id="a-real-world-example-monthly-spending-survey-in-usd"><span class="header-section-number">5</span> A Real-World Example: Monthly Spending Survey in USD</h2>
<p>Let’s now apply what we’ve learned in a more realistic, international scenario.</p>
<p>Imagine a survey conducted in a mid-sized city where 40 individuals are asked:</p>
<blockquote class="blockquote">
<p><em>“How much money do you spend per month (in US Dollars)?”</em></p>
</blockquote>
<p>We simulate responses centered around $2,000, with a standard deviation of $500.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb7-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">set.seed</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">123</span>)</span>
<span id="cb7-2">n <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">40</span></span>
<span id="cb7-3">monthly_spending <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">round</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rnorm</span>(n, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">mean =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2000</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sd =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">500</span>), <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)</span>
<span id="cb7-4"></span>
<span id="cb7-5"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">head</span>(monthly_spending)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 1720 1885 2779 2035 2065 2858</code></pre>
</div>
</div>
<section id="descriptive-statistics" class="level3" data-number="5.1">
<h3 data-number="5.1" class="anchored" data-anchor-id="descriptive-statistics"><span class="header-section-number">5.1</span> Descriptive Statistics</h3>
<p>Now let’s compute the mean, standard deviation, and standard error:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb9-1">mean_spending <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(monthly_spending)</span>
<span id="cb9-2">sd_spending <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sd</span>(monthly_spending)</span>
<span id="cb9-3">se_spending <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> sd_spending <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sqrt</span>(n)</span>
<span id="cb9-4"></span>
<span id="cb9-5">mean_spending</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 2022.6</code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb11" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb11-1">sd_spending</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 448.8549</code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb13" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb13-1">se_spending</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 70.9702</code></pre>
</div>
</div>
<p>Let’s interpret the output:</p>
<ul>
<li><p><strong>Mean monthly spending</strong>: approximately 2023 USD</p></li>
<li><p><strong>Standard deviation</strong>: approximately 449 USD</p></li>
<li><p><strong>Standard error</strong>: approximately 71 USD</p></li>
</ul>
</section>
<section id="what-do-these-numbers-tell-us" class="level3" data-number="5.2">
<h3 data-number="5.2" class="anchored" data-anchor-id="what-do-these-numbers-tell-us"><span class="header-section-number">5.2</span> What Do These Numbers Tell Us?</h3>
<ul>
<li><p>The <strong>standard deviation</strong> tells us that individual spending varies by about 449 USD from the average. So one person may spend only around 1574 USD, while another spends over 2471 USD.</p></li>
<li><p>The <strong>standard error</strong> tells us that the <strong>average</strong> we see in this sample could fluctuate by about ±71 USD due to sampling variability.</p></li>
</ul>
<div>
<p>📌 While individuals differ significantly in spending habits, the sample mean is relatively stable thanks to a sufficient sample size <img src="https://latex.codecogs.com/png.latex?n%20="> 40</p>
</div>
</section>
<section id="visualizing-the-distribution" class="level3" data-number="5.3">
<h3 data-number="5.3" class="anchored" data-anchor-id="visualizing-the-distribution"><span class="header-section-number">5.3</span> Visualizing the Distribution</h3>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb15" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb15-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(ggplot2)</span>
<span id="cb15-2"></span>
<span id="cb15-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggplot</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">spending =</span> monthly_spending), <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> spending)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb15-4">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_histogram</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> ..density..), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">binwidth =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">250</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">fill =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"skyblue"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">color =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"white"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">alpha =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.7</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb15-5">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_density</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">color =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"darkblue"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">linewidth =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.2</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb15-6">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_vline</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">xintercept =</span> mean_spending), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">color =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"red"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">linetype =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"dashed"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">linewidth =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb15-7">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">labs</span>(</span>
<span id="cb15-8">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">title =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Distribution of Monthly Spending"</span>,</span>
<span id="cb15-9">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Monthly Spending (USD)"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Density"</span></span>
<span id="cb15-10">  )</span></code></pre></div></div>
<div class="cell-output-display">
<div>
<figure class="figure">
<p><img src="https://mfatihtuzen.github.io/posts/2025-06-11_sd_vs_se/index_files/figure-html/unnamed-chunk-8-1.png" class="img-fluid figure-img" width="672"></p>
</figure>
</div>
</div>
</div>
<p>This graph shows:</p>
<ul>
<li><p>The <strong>red dashed line</strong> is the sample mean 2023 USD</p></li>
<li><p>The <strong>width</strong> of the histogram and smooth curve represents the variability in spending.</p></li>
<li><p>This is captured by the <strong>standard deviation</strong>, not the standard error.</p></li>
</ul>
</section>
<section id="confidence-interval-for-the-mean" class="level3" data-number="5.4">
<h3 data-number="5.4" class="anchored" data-anchor-id="confidence-interval-for-the-mean"><span class="header-section-number">5.4</span> Confidence Interval for the Mean</h3>
<p>Let’s calculate a 95% confidence interval using the standard error:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb16" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb16-1">lower <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> mean_spending <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.96</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> se_spending</span>
<span id="cb16-2">upper <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> mean_spending <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.96</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> se_spending</span>
<span id="cb16-3"></span>
<span id="cb16-4"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(lower, upper)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 1883.498 2161.702</code></pre>
</div>
</div>
<p>Result:</p>
<blockquote class="blockquote">
<p><strong>Confidence interval:</strong> approximately 1883 to 2162 USD</p>
</blockquote>
<p>This tells us:</p>
<blockquote class="blockquote">
<p>“We are 95% confident that the true average monthly spending of the population lies between 1883 and 2162 USD.”</p>
</blockquote>
<p><strong>Remember:</strong> this range reflects <strong>uncertainty about the mean</strong>, not individual variability.</p>
</section>
</section>
<section id="conclusion" class="level2" data-number="6">
<h2 data-number="6" class="anchored" data-anchor-id="conclusion"><span class="header-section-number">6</span> Conclusion</h2>
<p>Standard deviation and standard error are often mentioned in the same breath, but they serve <strong>very different purposes</strong> in data analysis and statistical reasoning.</p>
<ul>
<li><strong>Standard deviation</strong> reflects the <strong>natural variability</strong> in a dataset. It tells us how different individuals are from one another.</li>
<li><strong>Standard error</strong> quantifies the <strong>precision of a sample estimate</strong>, such as the mean. It tells us how much we can trust our estimate of the population parameter.</li>
</ul>
<p>While they are mathematically related, confusing one for the other can lead to serious misinterpretations—especially in scientific communication, data journalism, or policymaking.</p>
<p>Here are some final takeaways:</p>
<ul>
<li>Use <strong>standard deviation</strong> when describing the data you have.</li>
<li>Use <strong>standard error</strong> when making inferences about the population from your sample.</li>
<li>Always <strong>label your charts and error bars clearly</strong>, and <strong>report sample size</strong> to give proper context.</li>
<li>Don’t mistake low standard error for low variability—it only means your estimate is more precise, not that your data is more uniform.</li>
</ul>
<blockquote class="blockquote">
<p>🎯 In short:<br>
<strong>Standard deviation tells you about your data.</strong><br>
<strong>Standard error tells you how much you can trust your mean.</strong></p>
</blockquote>
<p>Understanding this distinction is more than just a statistical nuance—it’s a sign of analytical maturity.</p>
</section>
<section id="references" class="level2" data-number="7">
<h2 data-number="7" class="anchored" data-anchor-id="references"><span class="header-section-number">7</span> References</h2>
<ul>
<li><p>James, G., Witten, D., Hastie, T., &amp; Tibshirani, R. (2021). <em>An Introduction to Statistical Learning with Applications in R</em>. Springer. <a href="https://www.statlearning.com" class="uri">https://www.statlearning.com</a></p></li>
<li><p>Moore, D. S., McCabe, G. P., &amp; Craig, B. A. (2017). <em>Introduction to the Practice of Statistics</em>. W.H. Freeman.</p></li>
<li><p>R Core Team. (2024). <em>R: A language and environment for statistical computing</em>. R Foundation for Statistical Computing. <a href="https://www.r-project.org" class="uri">https://www.r-project.org</a></p></li>
<li><p>Wickham, H., Çetinkaya-Rundel, M., &amp; Grolemund, G. (2023). <em>R for Data Science (2e)</em>. <a href="https://r4ds.hadley.nz" class="uri">https://r4ds.hadley.nz</a></p></li>
<li><p>Navarro, D. (2019). <em>Learning Statistics with R: A tutorial for psychology students and other beginners</em>. <a href="https://learningstatisticswithr.com" class="uri">https://learningstatisticswithr.com</a></p></li>
</ul>


<!-- -->

</section>

 ]]></description>
  <category>R</category>
  <category>Statistics</category>
  <category>Data Analysis</category>
  <category>Standard Deviation</category>
  <category>Standard Error</category>
  <guid>https://mfatihtuzen.github.io/posts/2025-06-11_sd_vs_se/</guid>
  <pubDate>Fri, 11 Jul 2025 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Correlation vs Causation: Understanding the Difference</title>
  <dc:creator>M. Fatih Tüzen</dc:creator>
  <link>https://mfatihtuzen.github.io/posts/2025-06-04_correlation_vs_causation/</link>
  <description><![CDATA[ 






<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://mfatihtuzen.github.io/posts/2025-06-04_correlation_vs_causation/corr_causation.png" class="img-fluid quarto-figure quarto-figure-center figure-img" width="900"></p>
</figure>
</div>
<section id="introduction" class="level2" data-number="1">
<h2 data-number="1" class="anchored" data-anchor-id="introduction"><span class="header-section-number">1</span> Introduction</h2>
<p>“Correlation is not causation” – it’s a refrain we hear often, yet the distinction between these concepts is deceptively easy to overlook. Correlation refers to a statistical association: when one variable changes, another tends to change as well. Causation, on the other hand, means a change in one variable <em>directly produces</em> a change in another. In other words, there is a cause-and-effect relationship. A crucial insight (sometimes phrased as <em>“causation implies correlation (but not vice versa)”</em>) is that while causation <em>always</em> entails some correlation, observing a correlation by itself <strong>does not prove causation</strong>. This article will explore the theoretical basis of correlation and causation, illustrate the difference with real-world examples in economics and healthcare, and demonstrate with R code how misleading correlations can arise – and how we can attempt to control for confounding factors. Along the way, we’ll dispel common misconceptions and share expert insights to encourage critical thinking about causal claims in data.</p>
<blockquote class="blockquote">
<p><a href="https://blog.gopenai.com/disentangling-causation-and-correlation-in-data-analysis-bbb60a2e1dd2"><strong>Judea Pearl</strong></a>, a pioneer of modern causal inference, put it succinctly: <em>“Correlation is not causation; merely observing a relationship between variables does not imply a causal connection”.</em> In practical terms, correlation is a symmetric relationship – X and Y vary together – whereas causation is directional: X produces Y. If two things are correlated, there are several possibilities: X causes Y, Y causes X, or some other factor influences both (or it could even be a chance coincidence). As statistician <a href="https://blog.gopenai.com/disentangling-causation-and-correlation-in-data-analysis-bbb60a2e1dd2"><strong>David Freedman</strong></a> warned, <em>“Misinterpreting correlation as causation can lead to erroneous conclusions and misguided actions”</em>. To use data responsibly, we must dig deeper than surface-level associations.</p>
</blockquote>
</section>
<section id="theoretical-background-correlation-in-a-nutshell" class="level2" data-number="2">
<h2 data-number="2" class="anchored" data-anchor-id="theoretical-background-correlation-in-a-nutshell"><span class="header-section-number">2</span> Theoretical Background: Correlation in a Nutshell</h2>
<p>In statistics, correlation is often measured by the Pearson correlation coefficient (usually notated <em>r</em>). Mathematically, for variables X and Y, this coefficient is defined as:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Ar_%7BXY%7D%20=%20%5Cfrac%7B%5C%5CCov(X,Y)%7D%7B%5Csigma_X%20%5Csigma_Y%7D%0A"></p>
<p>where Cov(X,Y) is the covariance and σ denotes standard deviations. This value ranges from –1 (perfect negative correlation) to +1 (perfect positive correlation). An <em>r</em> near 0 indicates no linear relationship. Correlation captures how closely two variables move in sync. For example, if higher values of X tend to coincide with higher values of Y (and lower with lower), the correlation is positive. If one tends to go up when the other goes down, the correlation is negative. Crucially, correlation is a descriptive statistic – it quantifies an association, <strong>but it does not explain why</strong> the variables are related.</p>
<p><em>Correlation alone is silent on mechanism.</em> It answers “Are X and Y related?” not “Does X change Y?”. To establish causation, we usually rely on theory, controlled experiments, or advanced observational study designs. In the language of causality, we think about interventions: if we <em>do</em> something to X, does Y change as a result? This is fundamentally a different question than observing X and Y moving together. Empirically, evidence of causation typically requires satisfying conditions such as temporal precedence (cause precedes effect), a credible mechanism linking X to Y, consistency with other evidence, and ruling out alternative explanations (confounders).</p>
</section>
<section id="why-correlation-causation-confounders-coincidences-and-reverse-causality" class="level2" data-number="3">
<h2 data-number="3" class="anchored" data-anchor-id="why-correlation-causation-confounders-coincidences-and-reverse-causality"><span class="header-section-number">3</span> Why Correlation ≠ Causation: Confounders, Coincidences, and Reverse Causality</h2>
<p>If correlation doesn’t imply causation, what might be going on when two variables track together? There are a few common scenarios:</p>
<ul>
<li><p><strong>Confounding (Third Variables):</strong> A hidden factor influences both variables. This <em>lurking variable</em> makes X and Y move together, creating an illusion that they’re directly linked. Classic example: children’s shoe sizes are strongly correlated with their reading ability. Obviously, <em>bigger shoes don’t make kids read better</em>. The confounder is age: older children have larger feet and also read more proficiently – age drives both. Once age is taken into account, the shoe size–reading correlation disappears. As Pearl humorously noted, <em>“The third variable problem highlights the danger of assuming causation based solely on correlation”</em>. We’ll demonstrate a confounding example with R shortly.</p></li>
<li><p><strong>Pure Coincidence (Spurious Correlations):</strong> With enough data, you’re bound to find some weird correlations by chance alone. In fact, whole websites are devoted to absurd correlations. <strong>Tyler Vigen’s</strong> famous collection of spurious correlations highlights gems like a 0.95 correlation between U.S. per capita cheese consumption and the number of people who died by becoming tangled in their bedsheets. It’s a comical reminder that with countless variables in the world, some will line up in sync purely by accident. High correlation can occur in entirely unrelated data — a cautionary tale for data miners. We should always ask: <em>Is there a plausible reason for this correlation, or could it be random?</em></p></li>
<li><p><strong>Reverse Causation (Directionality):</strong> Sometimes X and Y truly are causally related, but not in the direction one assumes. For example, suppose data show a correlation between depression and low vitamin D levels. Does lack of vitamin D cause depression, or do depressed individuals tend to get less sunlight and thus have lower vitamin D? The data alone can’t tell us the direction. Another example: cities with more police officers tend to have higher crime rates. This doesn’t mean police cause crime; rather, high-crime areas hire more police. In economic contexts, we’ll see debates like <em>“Do higher interest rates reduce inflation, or is it that rising inflation prompts central banks to raise rates?”</em> – in such cases, cause and effect can be easily confused if we only look at correlations.</p></li>
<li><p><strong>Selection Bias and Other Pitfalls:</strong> In observational data, how samples are collected can also create misleading correlations. For instance, a medical study might find that patients on a certain medication have higher survival rates – but if those patients were also healthier or younger on average (selection bias), the medication’s effect is confounded. <strong>Correlation can even vanish or flip sign when data is disaggregated</strong>, a phenomenon known as Simpson’s Paradox. The aggregate data might show one trend, while each subgroup shows the opposite trend. This often indicates a confounding variable at play.</p></li>
</ul>
<section id="simulating-a-spurious-correlation-in-r" class="level3" data-number="3.1">
<h3 data-number="3.1" class="anchored" data-anchor-id="simulating-a-spurious-correlation-in-r"><span class="header-section-number">3.1</span> Simulating a Spurious Correlation in R</h3>
<p>To make these ideas concrete, let’s simulate an example in R. We’ll create a scenario with two groups (“Young” and “Old”) where <strong>within each group, there is no relationship</strong> between our variables, but when we combine the groups, we observe a strong correlation. This mimics a confounding situation (here, age group is the confounder).</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Simulate data for two groups: 'Young' and 'Old'</span></span>
<span id="cb1-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">set.seed</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">42</span>)</span>
<span id="cb1-3">AgeGroup <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rep</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Young"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Old"</span>), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">each =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">50</span>)</span>
<span id="cb1-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># For Young group, generate foot_size and reading_score with no true correlation</span></span>
<span id="cb1-5">foot_size <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rnorm</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">50</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">mean =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">20</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sd =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>),    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Young have smaller feet on average</span></span>
<span id="cb1-6">               <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rnorm</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">50</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">mean =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">25</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sd =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>))    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Old have larger feet on average</span></span>
<span id="cb1-7">reading_score <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rnorm</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">50</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">mean =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">50</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sd =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>),<span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Young have lower reading scores on avg</span></span>
<span id="cb1-8">                   <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rnorm</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">50</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">mean =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">80</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sd =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>))<span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Old have higher reading scores on avg</span></span>
<span id="cb1-9"></span>
<span id="cb1-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Check correlations</span></span>
<span id="cb1-11"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">cor</span>(foot_size, reading_score)                        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># overall correlation</span></span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 0.7597618</code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb3-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">cor</span>(foot_size[AgeGroup<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Young"</span>], reading_score[AgeGroup<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Young"</span>])  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># within Young group</span></span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 0.1043372</code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb5-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">cor</span>(foot_size[AgeGroup<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Old"</span>], reading_score[AgeGroup<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Old"</span>])      <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># within Old group</span></span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] -0.07429122</code></pre>
</div>
</div>
<p>Running the code above, we might find an overall Pearson correlation around ~0.75 between <code>foot_size</code> and <code>reading_score</code> for all 100 individuals combined. Yet within each age group separately, the correlation is near 0 (essentially no relationship). In our simulation, <em>foot size</em> was not actually affecting <em>reading ability</em> at all – the apparent overall correlation arose because the Old group had higher values for both variables than the Young group. Age group was the lurking factor. This is a toy example of Simpson’s Paradox, where aggregation masks the true story.</p>
<p>We can visualize this:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb7-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Plot the data, coloring by group, and add regression lines</span></span>
<span id="cb7-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(ggplot2)</span>
<span id="cb7-3">df <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(AgeGroup, foot_size, reading_score)</span>
<span id="cb7-4"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggplot</span>(df, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> foot_size, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> reading_score, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">color =</span> AgeGroup)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb7-5">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_point</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">size =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">alpha =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.7</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb7-6">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_smooth</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">method =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"lm"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">se =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> </span>
<span id="cb7-7">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_smooth</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">group =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">method =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"lm"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">se =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">color =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"black"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">linetype =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"dashed"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb7-8">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">labs</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Foot size (cm)"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Reading score"</span>,</span>
<span id="cb7-9">       <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">title =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Spurious correlation: Foot size vs Reading score"</span>,</span>
<span id="cb7-10">       <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">subtitle =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Colored by AgeGroup. Solid lines = separate group fits (no correlation); dashed line = overall fit."</span>)</span></code></pre></div></div>
<div class="cell-output-display">
<div>
<figure class="figure">
<p><img src="https://mfatihtuzen.github.io/posts/2025-06-04_correlation_vs_causation/index_files/figure-html/unnamed-chunk-2-1.png" class="img-fluid figure-img" width="672"></p>
</figure>
</div>
</div>
</div>
<p><strong>Interpretation:</strong> <em>In the plot, blue points (younger kids) cluster toward the lower-left, and red points (older kids) cluster at the upper-right. The black dashed line through all data has a clear upward slope, indicating a positive overall correlation. However, the solid trend lines fitted to each group are nearly flat – within each group there’s no meaningful correlation between foot size and reading skill. It’s the group difference (older children are both larger and more literate) that created the misleading overall association.</em> This example underscores why we must be cautious: if we naively observed all the data, we might have (laughably) concluded that “big feet cause better reading”! Only by accounting for the confounding variable (age) do we see the true picture.</p>
<p><em>Figure: An example of spurious correlation. Each point is an individual child; foot size and reading score are uncorrelated within the Young (blue) and Old (red) groups, but when pooled together there is a strong positive correlation. The overall trend (black dashed line) is entirely driven by the age-group effect. Such patterns illustrate how a lurking variable can create a misleading correlation.</em></p>
</section>
</section>
<section id="real-world-example-1-interest-rates-and-inflation" class="level2" data-number="4">
<h2 data-number="4" class="anchored" data-anchor-id="real-world-example-1-interest-rates-and-inflation"><span class="header-section-number">4</span> Real-World Example 1: Interest Rates and Inflation</h2>
<p>One arena where correlation vs causation debates rage is macroeconomics. Consider <strong>interest rates and inflation</strong> – two metrics that often move in tandem. Central banks (like the U.S. Federal Reserve or Bank of England) adjust interest rates as a policy tool, aiming to control inflation. Intuition says raising interest rates should <em>cause</em> inflation to decrease (by cooling off spending and investment). Indeed, periods of tight monetary policy often coincide with inflation coming down. But does that correlation mean the rate hikes caused the relief in inflation? Not necessarily. As one economics blogger noted during the 2022–2023 inflation surge: <em>“One might be tempted to draw a direct line between higher interest rates and lower inflation rates. But correlation does not necessarily imply causation.”</em> In that episode, global inflation started easing after its 2022 peak, at the same time central banks were aggressively raising rates. However, careful analysis suggested much of the inflation decline was due to resolving supply chain issues and falling commodity prices – factors largely independent of interest rate moves. In other words, inflation <em>would</em> have started abating on its own as pandemic-era supply shocks faded, even if interest rates had not been hiked so sharply. The overlap in timing was a correlation, not a definitive proof of causation.</p>
<p>Economists have to untangle these relationships with statistical tools and historical data. One approach is to look at <em>lead-lag</em> relationships: if interest rate changes truly cause lower inflation, we’d expect to see inflation consistently drop a few quarters <strong>after</strong> rate hikes. If instead we observe that inflation spikes often <em>precede</em> rate hikes (as central banks react <em>to</em> rising inflation), that indicates reverse causation – inflation causing interest rate changes. Studies of the UK economy, for instance, found that in the short run, raising interest rates sometimes correlated with <strong>higher</strong> inflation in subsequent quarters. This counter-intuitive positive correlation could mean that initial rate hikes were implemented when inflation was already rising (so inflation kept climbing shortly after), or that rate hikes had supply-side effects (e.g.&nbsp;raising business costs) that temporarily <em>stoked</em> inflation. Only after a longer time lag did the correlation turn negative as expected (inflation easing modestly) – and even then, the effect was statistically weak in some analyses. An outside observer summed up the mixed evidence wryly: <em>“If correlation means causality then possibly not. [Rate hikes] may have an effect, but the effect might be weak on inflation and brutal on society”</em>. In other words, simply correlating past interest increases with inflation outcomes can be misleading; it takes careful modeling to isolate the causal impact (and it might be smaller than popularly assumed).</p>
<p>This example highlights two key points: First, <strong>directionality matters</strong> – are we seeing X→Y or Y→X or both? (In economics, feedback loops are common: inflation could prompt rate changes, which in turn influence future inflation, a two-way causality.) Second, <strong>confounding variables abound</strong> – other factors like global supply conditions, fiscal policy, or consumer expectations can drive inflation, obscuring the effect of interest rates. Analysts tackle these challenges with techniques such as Vector Autoregression (VAR) models, instrumental variables, or by “clustering” data to compare similar periods or countries. A commenter on an economics forum pointed out that failing to control for such factors is akin to falling for Simpson’s Paradox: <em>“Plotting inflation vs interest rates can be misleading unless you cluster to avoid confounding variables”</em>. The lesson: even in highly data-driven fields like economics, correlation alone can support multiple stories, and solid conclusions require digging into the causal structure of the problem.</p>
</section>
<section id="real-world-example-2-vaccines-and-disease-prevalence" class="level2" data-number="5">
<h2 data-number="5" class="anchored" data-anchor-id="real-world-example-2-vaccines-and-disease-prevalence"><span class="header-section-number">5</span> Real-World Example 2: Vaccines and Disease Prevalence</h2>
<p>Few areas demonstrate the difference between correlation and causation as starkly as healthcare and epidemiology. Let’s examine vaccines and disease rates. Vaccines are designed based on a known causal mechanism (they induce immunity, which <strong>prevents disease</strong>), and countless studies and trials have validated their efficacy. Thus, when a vaccine is introduced, we expect disease incidence to drop as a causal result. Conversely, if vaccination rates fall, diseases can surge. Both correlations have been observed in reality – one led to a life-saving public health success, the other to a dangerous resurgence of disease – and they underline why understanding causality is critical.</p>
<p><strong>Correlation used as evidence of causation (correctly):</strong> In the 1950s, polio was a dreaded disease paralyzing tens of thousands each year. In 1955, the Salk polio vaccine was introduced. Within just a few years, polio cases plummeted. In the United States, annual polio cases dropped from ~58,000 to about 5,600 by 1957, and only 161 cases by 1961. The timing and magnitude of this drop, alongside laboratory and clinical evidence, provided convincing proof that the vaccine <em>caused</em> the decline in polio. Here the correlation (vaccine rollout followed by disease collapse) was no coincidence – it was a predicted outcome based on a causal understanding of immunity. As another example, when the HPV vaccine was introduced, health officials observed sharp declines in HPV infections and related cancers in subsequent years, consistent with the expected causal effect of vaccinating adolescents. In such cases, correlation was a <strong>strong hint</strong> that led scientists to conclude causation, bolstered by controlled trials and biological plausibility.</p>
<p><strong>Correlation misinterpreted as causation (incorrectly):</strong> Not all observed links are what they seem. A notorious case was the now-debunked claim that the MMR (measles, mumps, rubella) vaccine caused autism in children. This idea stemmed from a 1998 study (later found fraudulent) and the anecdotal observation that autism diagnoses were often made around the same age children receive lots of vaccines. In truth, the <strong>apparent correlation</strong> was driven by coincidental timing and increased awareness/diagnosis of autism in the 1990s – not by vaccines. Extensive research over decades showed <em>no causal link</em>: <em>“Research over the past 15 years has shown that childhood vaccines don’t cause autism”.</em> Unfortunately, the fear incited by the false correlation led many parents to avoid the MMR vaccine. The result? Measles, once near-eliminated, came roaring back. Great Britain, for instance, experienced a measles epidemic in the 2000s as vaccination rates fell. Public health officials directly attributed this to the drop in vaccinations after the autism scare: <em>“Great Britain is in the midst of a measles epidemic, one that public health officials say is the result of parents refusing to vaccinate their children after a safety scare that was later proved to be fraudulent”</em>. In regions where MMR vaccination rates fell below about 80%, measles cases spiked dramatically. One commentator lamented, <em>“This is the legacy of the Wakefield scare”.</em> The correlation here – <strong>lower</strong> vaccination accompanied by <strong>higher</strong> disease incidence – reflected a causal relationship, but in the <strong>opposite direction</strong> of the original false claim. Vaccines <em>prevent</em> measles, so when vaccination dropped, measles returned. It’s a sobering example of how a misunderstood correlation (vaccines and autism) led to behaviors that revealed a very real causation (lack of vaccines causing disease outbreaks).</p>
<p>In summary, the vaccine story teaches us that <strong>we must have external evidence and domain knowledge to distinguish meaningful correlations from spurious ones</strong>. When strong theory and additional evidence support a correlation (as with vaccines preventing disease), we can infer causation with confidence. But when a correlation flies in the face of established knowledge or lacks a plausible mechanism (as with vaccines causing autism), it demands deep skepticism and further investigation. Correlation may open the door to a hypothesis, but only rigorous science can confirm causality.</p>
</section>
<section id="from-correlation-to-causation-how-can-we-tell" class="level2" data-number="6">
<h2 data-number="6" class="anchored" data-anchor-id="from-correlation-to-causation-how-can-we-tell"><span class="header-section-number">6</span> From Correlation to Causation: How Can We Tell?</h2>
<p>So, if correlation alone isn’t enough, how do scientists and statisticians actually establish causation? This is the realm of <em>causal inference</em>, and entire textbooks (and careers) are devoted to it. Here we’ll outline a few key principles and methods:</p>
<ul>
<li><p><strong>Controlled Experiments:</strong> The gold standard for testing causality is the randomized controlled trial (RCT). By randomly assigning subjects to a treatment (X) or control, we ensure no systematic confounders differ between groups. Any difference in outcomes (Y) can then be attributed to X (within known statistical error). As statistician <a href="https://blog.gopenai.com/disentangling-causation-and-correlation-in-data-analysis-bbb60a2e1dd2"><strong>Paul Rosenbaum</strong></a> emphasizes, <em>“Experimental design is crucial for establishing causal relationships and overcoming confounding factors”</em>. In fields like medicine, RCTs are required to claim a drug <em>causes</em> an effect. In more complex domains (economics, social sciences) where RCTs may be infeasible or unethical, researchers look for natural experiments or instrumental variables to approximate that level of control.</p></li>
<li><p><strong>Temporal Checks:</strong> Ensure the cause precedes the effect. Sounds obvious, but it’s a simple way to weed out some mistaken causal interpretations. If Y happens before X, X cannot be the cause. Sometimes lagged correlations or time-series analyses (like <strong>Granger causality</strong> tests in economics) are used to see if changes in X consistently come before changes in Y. In our interest rate example, analysts examined whether inflation tended to drop <em>after</em> interest rate hikes (and found mixed results, indicating caution in the causal claim).</p></li>
<li><p><strong>Controlling for Confounders:</strong> In observational studies, a common strategy is to measure possible confounding variables and include them in a regression or stratify the analysis. For instance, if we suspect age is a confounder in our earlier example, we can compare individuals of similar age (or include age in a multiple regression model) to see if foot size still correlates with reading ability within those strata. If the correlation vanishes after controlling for the third variable, it was likely spurious. Techniques like <strong>multiple regression</strong>, <strong>matching</strong>, <strong>propensity score adjustment</strong>, and <strong>difference-in-differences</strong> analysis are all about simulating a “ceteris paribus” condition – i.e.&nbsp;comparing like with like, so that the effect of interest can be isolated. In R, one might use <code>lm()</code> (linear modeling) to adjust for confounders. For example, <code>lm(reading_score ~ foot_size + Age, data=df)</code> would tell us if foot_size still has any predictive power for reading_score once Age is accounted for. (In our simulated data, it would show foot_size is not significant when Age is included, reinforcing that foot size itself wasn’t causing better reading.)</p></li>
<li><p><strong>Multiple Studies and Triangulation:</strong> We gain confidence in causation when multiple independent studies, using different methods, consistently point to the same conclusion. If correlational evidence is supported by lab experiments, longitudinal studies, and perhaps natural experiments, the case for causality strengthens. In the smoking and lung cancer example: early on, skeptics said “correlation is not causation” – maybe smokers had other habits causing cancer. But over time, mountains of evidence (animal experiments, biological mechanisms, epidemiological studies controlling for diet, etc.) converged to establish that smoking <em>does</em> cause cancer.</p></li>
<li><p><strong>Plausibility and Mechanism:</strong> A correlation accompanied by a plausible mechanism is more convincing. If we can explain <em>how</em> X could influence Y (through physics, biology, or logic), we are more likely to consider X a potential cause of Y. In contrast, if no one can conceive a realistic way that X would affect Y, we suspect a lurking variable or coincidence. (For instance, it’s hard to imagine how eating more cheese would directly cause strangulation by bedsheets – more likely, as one humorous analysis noted, it’s just an “accidental, misleading pattern” or related to a confounder like time or lifestyle).</p></li>
<li><p><strong>Causal Graphs and Models:</strong> Modern data science sometimes employs causal graphs or Bayesian networks (à la Judea Pearl’s <strong>do-calculus</strong>) to formally model assumptions about causation and test if the observed correlations fit a causal structure. While beyond the scope of this article, these tools provide a framework to encode “X causes Y” assumptions and see what observational patterns should emerge if that’s true. They also help identify what additional data or experiments are needed to distinguish between competing causal hypotheses.</p></li>
</ul>
<p>In practice, determining causation is often like solving a puzzle. We marshal all available evidence, use critical thinking, and sometimes still end up with uncertainty. However, the effort is worthwhile because acting on false causal assumptions can be costly. Misattributing causation can lead to bad policy, ineffective or harmful interventions, or simply wasting resources chasing the wrong problem. As we’ve seen, <strong>data should be approached with a skeptical eye</strong>. Correlations can be tantalizing – they can indeed be hints to causal relationships – but we must verify those hints. By combining statistical rigor with domain expertise, we improve our chances of getting the causation right.</p>
</section>
<section id="conclusion" class="level2" data-number="7">
<h2 data-number="7" class="anchored" data-anchor-id="conclusion"><span class="header-section-number">7</span> Conclusion</h2>
<p>Understanding the difference between correlation and causation is essential for anyone who consumes data-driven information (which these days is all of us). We’ve covered how correlation is a mathematical relationship that can flag interesting connections but can also mislead us through confounding, coincidence, or reversed cause-and-effect. We explored examples from economics and healthcare where these distinctions have real-world consequences – from guiding central bank policies to informing public health decisions. The key takeaway is to <strong>think critically</strong>: when you hear that X is linked to Y, ask <em>why</em> and <em>how</em>. Look for evidence that goes beyond the raw correlation. As the saying goes (often attributed to many scientists), <em>“Correlation is not causation, but it sure is a hint.”</em> Use the hint to investigate further, not to jump to conclusions.</p>
<p>In the words of statistician David Freedman, misinterpreting correlation as causation is not just an academic error but one that can lead to “misguided actions”. By staying curious and skeptical – and by leveraging tools like R to analyze data properly – we can uncover the true stories our data are telling us. Correlation can open the door to discovery, but only rigorous analysis and critical thinking will reveal what’s inside.</p>
</section>
<section id="references-further-reading" class="level2" data-number="8">
<h2 data-number="8" class="anchored" data-anchor-id="references-further-reading"><span class="header-section-number">8</span> References &amp; Further Reading</h2>
<ul>
<li><p><a href="https://macro-to-go.com/2023/06/05/correlation-does-not-imply-causation-the-case-of-higher-interest-and-lower-inflation-rates/#:~:text=As%20the%20inflation%20peak%20roughly,%E2%80%93%20with%20one%20important%20caveat">Bandholz, H. “Correlation does not imply causation: higher interest and lower inflation rates” – Macro to Go blog (2023)</a></p></li>
<li><p><a href="https://medium.com/@eddyojb/do-interest-rates-tackle-inflation-af4bfce45b88">Eddy-OJB. <em>“Do Interest Rates Tackle Inflation?”</em> – <em>Medium</em> (2023)</a></p></li>
<li><p><a href="https://www.statsig.com/perspectives/misleading-correlations-examples#:~:text=Tyler%20Vigen%27s%20site%2C%20Spurious%20Correlations%2C,data%20with%20a%20critical%20eye">Statsig Team. Examples of misleading correlations (2024)</a></p></li>
<li><p><a href="https://priceonomics.com/do-storks-deliver-babies/#:~:text=Lo%20and%20behold%2C%20a%20correlation,of%20storks%20in%20a%20country">Matthews, R. <em>“Storks Deliver Babies (p = 0.008)”</em> – <em>Teaching Statistics</em> via Priceonomics</a></p></li>
<li><p><a href="https://www.scribbr.com/methodology/correlation-vs-causation/#:~:text=Causation%20means%20that%20changes%20in,a%20causal%20link%20between%20them">Scribbr – <em>Correlation vs Causation</em> (2021)</a></p></li>
<li><p><a href="https://www.npr.org/sections/health-shots/2013/05/21/185801259/fifteen-years-after-a-vaccine-scare-a-measles-epidemic#:~:text=Most%20of%20the%20measles%20cases,the%20Wakefield%20paper%20was%20published">NPR – <em>Measles epidemic after vaccine scare</em> (2013)</a></p></li>
<li><p><a href="https://www.who.int/news-room/spotlight/history-of-vaccination/history-of-polio-vaccination#:~:text=The%20results%20were%20announced%20on,1961%2C%20only%20161%20cases%20remained">WHO – <em>History of polio vaccination</em> (2021)</a></p></li>
</ul>


<!-- -->

</section>

 ]]></description>
  <category>Statistics</category>
  <category>Correlation</category>
  <category>Causation</category>
  <category>Econometrics</category>
  <category>Causal Inference</category>
  <category>Public Health</category>
  <guid>https://mfatihtuzen.github.io/posts/2025-06-04_correlation_vs_causation/</guid>
  <pubDate>Wed, 04 Jun 2025 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Explained vs. Predictive Power: R², Adjusted R², and Beyond</title>
  <dc:creator>M. Fatih Tüzen</dc:creator>
  <link>https://mfatihtuzen.github.io/posts/2025-04-30_rsquared/</link>
  <description><![CDATA[ 






<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://mfatihtuzen.github.io/posts/2025-04-30_rsquared/quote_box.png" class="img-fluid quarto-figure quarto-figure-center figure-img" style="width:8cm"></p>
</figure>
</div>
<section id="introduction" class="level1" data-number="1">
<h1 data-number="1"><span class="header-section-number">1</span> Introduction</h1>
<blockquote class="blockquote">
<p><strong>You trust R². Should you?</strong><br>
You proudly present a model with R² = 0.95. Everyone applauds.<br>
But what if your model fails miserably on the next new data?</p>
</blockquote>
<p>When building a statistical model, one of the first numbers analysts and data scientists often cite is the <strong>R²</strong>, or coefficient of determination. It’s widely reported in research, academic theses, and industry reports — and yet, frequently misunderstood or misused.</p>
<p>Does a high R² mean your model is good? Is it enough to evaluate model performance? What about its adjusted or predictive counterparts?</p>
<p>This article will explore in depth: - What R², Adjusted R², and Predicted R² actually mean - Why relying solely on R² can mislead you - How to evaluate models using <strong>both explanatory and predictive power</strong> - Real-life implementation using the <strong>{tidymodels}</strong> framework in R</p>
<p>We’ll also discuss best practices and common pitfalls, and equip you with a mindset to look beyond surface-level model summaries.</p>
</section>
<section id="theoretical-background" class="level1" data-number="2">
<h1 data-number="2"><span class="header-section-number">2</span> Theoretical Background</h1>
<section id="what-is-r²" class="level2" data-number="2.1">
<h2 data-number="2.1" class="anchored" data-anchor-id="what-is-r²"><span class="header-section-number">2.1</span> What is R²?</h2>
<p>The <strong>coefficient of determination</strong>, R², is defined as:</p>
<p><img src="https://latex.codecogs.com/png.latex?R%5E2%20=%201%20-%20%5Cfrac%7B%5Ctext%7BSS%7D_%7B%5Ctext%7Bres%7D%7D%7D%7B%5Ctext%7BSS%7D_%7B%5Ctext%7Btot%7D%7D%7D"></p>
<p>Where:</p>
<ul>
<li><p><img src="https://latex.codecogs.com/png.latex?%5Ctext%7BSS%7D_%7B%5Ctext%7Bres%7D%7D"> = Sum of squares of residuals = <img src="https://latex.codecogs.com/png.latex?%5Csum%20(y_i%20-%20%5Chat%7By%7D_i)%5E2"></p></li>
<li><p><img src="https://latex.codecogs.com/png.latex?%5Ctext%7BSS%7D_%7B%5Ctext%7Btot%7D%7D"> = Total sum of squares = <img src="https://latex.codecogs.com/png.latex?%5Csum%20(y_i%20-%20%5Cbar%7By%7D)%5E2"></p></li>
</ul>
<p>It tells us the <strong>proportion of variance explained by the model</strong>. An R² of 0.80 implies that 80% of the variability in the dependent variable is explained by the model.</p>
<p>But beware — it only measures <strong>fit to training data</strong>, not the model’s ability to <strong>generalize</strong>.</p>
</section>
<section id="adjusted-r²" class="level2" data-number="2.2">
<h2 data-number="2.2" class="anchored" data-anchor-id="adjusted-r²"><span class="header-section-number">2.2</span> Adjusted R²</h2>
<p>When we add predictors to a regression model, R² will never decrease — even if the added variables are irrelevant.</p>
<p><strong>Adjusted R²</strong> corrects this by penalizing the number of predictors: <img src="https://latex.codecogs.com/png.latex?R%5E2_%7B%5Ctext%7Badj%7D%7D%20=%201%20-%20%5Cleft(1%20-%20R%5E2%5Cright)%20%5Ccdot%20%5Cleft(%5Cfrac%7Bn%20-%201%7D%7Bn%20-%20p%20-%201%7D%5Cright)"></p>
<p>Where:</p>
<ul>
<li><p>n : number of observations</p></li>
<li><p>p : number of predictors</p></li>
</ul>
<p>Thus, Adjusted R² will <strong>only increase</strong> if the new predictor improves the model more than expected by chance.</p>
</section>
<section id="predicted-r²" class="level2" data-number="2.3">
<h2 data-number="2.3" class="anchored" data-anchor-id="predicted-r²"><span class="header-section-number">2.3</span> Predicted R²</h2>
<p><strong>Predicted R²</strong> (or cross-validated R²) is the most honest estimate of model utility. It answers the question:</p>
<blockquote class="blockquote">
<p><em>How well will this model predict new, unseen data?</em></p>
</blockquote>
<p>This is typically calculated using cross-validation, and unlike regular R², it reflects <strong>out-of-sample performance</strong>.</p>
<p>You can also view it as:</p>
<p><img src="https://latex.codecogs.com/png.latex?R%5E2_%7B%5Ctext%7Bpred%7D%7D%20=%201%20-%20%5Cfrac%7B%5Ctext%7BPRESS%7D%7D%7B%5Ctext%7BSS%7D_%7B%5Ctext%7Btot%7D%7D%7D"></p>
<p>Where PRESS is the <strong>Prediction Error Sum of Squares</strong> based on cross-validation.</p>
</section>
</section>
<section id="dataset-overview" class="level1" data-number="3">
<h1 data-number="3"><span class="header-section-number">3</span> Dataset Overview</h1>
<p>We’ll use the classic <strong>Boston Housing Dataset</strong> to demonstrate. It includes:</p>
<ul>
<li><p>Socio-economic and housing variables for 506 Boston suburbs</p></li>
<li><p>Target: <code>medv</code> (median value of owner-occupied homes in $1000s)</p></li>
</ul>
<p>Below are the key variables:</p>
<ul>
<li><strong>crim</strong>: per capita crime rate by town</li>
<li><strong>zn</strong>: proportion of residential land zoned for large lots</li>
<li><strong>indus</strong>: proportion of non-retail business acres</li>
<li><strong>chas</strong>: Charles River dummy variable (1 = tract bounds river; 0 = otherwise)</li>
<li><strong>nox</strong>: nitric oxides concentration (parts per 10 million)</li>
<li><strong>rm</strong>: average number of rooms per dwelling</li>
<li><strong>age</strong>: proportion of owner-occupied units built before 1940</li>
<li><strong>dis</strong>: weighted distance to employment centers</li>
<li><strong>rad</strong>: index of accessibility to radial highways</li>
<li><strong>tax</strong>: property-tax rate per $10,000</li>
<li><strong>ptratio</strong>: pupil-teacher ratio by town</li>
<li><strong>black</strong>: 1000(Bk - 0.63)^2 where Bk is the proportion of Black residents</li>
<li><strong>lstat</strong>: percentage of lower status of the population</li>
<li><strong>medv</strong>: <strong>target</strong> — median value of owner-occupied homes (in $1000s)</li>
</ul>
<p>This regression problem mimics common real estate or socio-economic modeling use cases. Let’s first examine the dataset’s summary statistics.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(tidymodels)</span>
<span id="cb1-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(MASS)</span>
<span id="cb1-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(ggplot2)</span>
<span id="cb1-4"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(corrr)</span>
<span id="cb1-5"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(skimr)</span>
<span id="cb1-6"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(patchwork)</span>
<span id="cb1-7"></span>
<span id="cb1-8"></span>
<span id="cb1-9">boston <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> MASS<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>Boston</span>
<span id="cb1-10"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">skim</span>(boston)</span></code></pre></div></div>
<div class="cell-output-display">
<table class="caption-top table table-sm table-striped small">
<caption>Data summary</caption>
<tbody>
<tr class="odd">
<td style="text-align: left;">Name</td>
<td style="text-align: left;">boston</td>
</tr>
<tr class="even">
<td style="text-align: left;">Number of rows</td>
<td style="text-align: left;">506</td>
</tr>
<tr class="odd">
<td style="text-align: left;">Number of columns</td>
<td style="text-align: left;">14</td>
</tr>
<tr class="even">
<td style="text-align: left;">_______________________</td>
<td style="text-align: left;"></td>
</tr>
<tr class="odd">
<td style="text-align: left;">Column type frequency:</td>
<td style="text-align: left;"></td>
</tr>
<tr class="even">
<td style="text-align: left;">numeric</td>
<td style="text-align: left;">14</td>
</tr>
<tr class="odd">
<td style="text-align: left;">________________________</td>
<td style="text-align: left;"></td>
</tr>
<tr class="even">
<td style="text-align: left;">Group variables</td>
<td style="text-align: left;">None</td>
</tr>
</tbody>
</table>
<p><strong>Variable type: numeric</strong></p>
<table class="caption-top table table-sm table-striped small">
<colgroup>
<col style="width: 15%">
<col style="width: 10%">
<col style="width: 15%">
<col style="width: 7%">
<col style="width: 7%">
<col style="width: 7%">
<col style="width: 7%">
<col style="width: 7%">
<col style="width: 7%">
<col style="width: 7%">
<col style="width: 6%">
</colgroup>
<thead>
<tr class="header">
<th style="text-align: left;">skim_variable</th>
<th style="text-align: right;">n_missing</th>
<th style="text-align: right;">complete_rate</th>
<th style="text-align: right;">mean</th>
<th style="text-align: right;">sd</th>
<th style="text-align: right;">p0</th>
<th style="text-align: right;">p25</th>
<th style="text-align: right;">p50</th>
<th style="text-align: right;">p75</th>
<th style="text-align: right;">p100</th>
<th style="text-align: left;">hist</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;">crim</td>
<td style="text-align: right;">0</td>
<td style="text-align: right;">1</td>
<td style="text-align: right;">3.61</td>
<td style="text-align: right;">8.60</td>
<td style="text-align: right;">0.01</td>
<td style="text-align: right;">0.08</td>
<td style="text-align: right;">0.26</td>
<td style="text-align: right;">3.68</td>
<td style="text-align: right;">88.98</td>
<td style="text-align: left;">▇▁▁▁▁</td>
</tr>
<tr class="even">
<td style="text-align: left;">zn</td>
<td style="text-align: right;">0</td>
<td style="text-align: right;">1</td>
<td style="text-align: right;">11.36</td>
<td style="text-align: right;">23.32</td>
<td style="text-align: right;">0.00</td>
<td style="text-align: right;">0.00</td>
<td style="text-align: right;">0.00</td>
<td style="text-align: right;">12.50</td>
<td style="text-align: right;">100.00</td>
<td style="text-align: left;">▇▁▁▁▁</td>
</tr>
<tr class="odd">
<td style="text-align: left;">indus</td>
<td style="text-align: right;">0</td>
<td style="text-align: right;">1</td>
<td style="text-align: right;">11.14</td>
<td style="text-align: right;">6.86</td>
<td style="text-align: right;">0.46</td>
<td style="text-align: right;">5.19</td>
<td style="text-align: right;">9.69</td>
<td style="text-align: right;">18.10</td>
<td style="text-align: right;">27.74</td>
<td style="text-align: left;">▇▆▁▇▁</td>
</tr>
<tr class="even">
<td style="text-align: left;">chas</td>
<td style="text-align: right;">0</td>
<td style="text-align: right;">1</td>
<td style="text-align: right;">0.07</td>
<td style="text-align: right;">0.25</td>
<td style="text-align: right;">0.00</td>
<td style="text-align: right;">0.00</td>
<td style="text-align: right;">0.00</td>
<td style="text-align: right;">0.00</td>
<td style="text-align: right;">1.00</td>
<td style="text-align: left;">▇▁▁▁▁</td>
</tr>
<tr class="odd">
<td style="text-align: left;">nox</td>
<td style="text-align: right;">0</td>
<td style="text-align: right;">1</td>
<td style="text-align: right;">0.55</td>
<td style="text-align: right;">0.12</td>
<td style="text-align: right;">0.38</td>
<td style="text-align: right;">0.45</td>
<td style="text-align: right;">0.54</td>
<td style="text-align: right;">0.62</td>
<td style="text-align: right;">0.87</td>
<td style="text-align: left;">▇▇▆▅▁</td>
</tr>
<tr class="even">
<td style="text-align: left;">rm</td>
<td style="text-align: right;">0</td>
<td style="text-align: right;">1</td>
<td style="text-align: right;">6.28</td>
<td style="text-align: right;">0.70</td>
<td style="text-align: right;">3.56</td>
<td style="text-align: right;">5.89</td>
<td style="text-align: right;">6.21</td>
<td style="text-align: right;">6.62</td>
<td style="text-align: right;">8.78</td>
<td style="text-align: left;">▁▂▇▂▁</td>
</tr>
<tr class="odd">
<td style="text-align: left;">age</td>
<td style="text-align: right;">0</td>
<td style="text-align: right;">1</td>
<td style="text-align: right;">68.57</td>
<td style="text-align: right;">28.15</td>
<td style="text-align: right;">2.90</td>
<td style="text-align: right;">45.02</td>
<td style="text-align: right;">77.50</td>
<td style="text-align: right;">94.07</td>
<td style="text-align: right;">100.00</td>
<td style="text-align: left;">▂▂▂▃▇</td>
</tr>
<tr class="even">
<td style="text-align: left;">dis</td>
<td style="text-align: right;">0</td>
<td style="text-align: right;">1</td>
<td style="text-align: right;">3.80</td>
<td style="text-align: right;">2.11</td>
<td style="text-align: right;">1.13</td>
<td style="text-align: right;">2.10</td>
<td style="text-align: right;">3.21</td>
<td style="text-align: right;">5.19</td>
<td style="text-align: right;">12.13</td>
<td style="text-align: left;">▇▅▂▁▁</td>
</tr>
<tr class="odd">
<td style="text-align: left;">rad</td>
<td style="text-align: right;">0</td>
<td style="text-align: right;">1</td>
<td style="text-align: right;">9.55</td>
<td style="text-align: right;">8.71</td>
<td style="text-align: right;">1.00</td>
<td style="text-align: right;">4.00</td>
<td style="text-align: right;">5.00</td>
<td style="text-align: right;">24.00</td>
<td style="text-align: right;">24.00</td>
<td style="text-align: left;">▇▂▁▁▃</td>
</tr>
<tr class="even">
<td style="text-align: left;">tax</td>
<td style="text-align: right;">0</td>
<td style="text-align: right;">1</td>
<td style="text-align: right;">408.24</td>
<td style="text-align: right;">168.54</td>
<td style="text-align: right;">187.00</td>
<td style="text-align: right;">279.00</td>
<td style="text-align: right;">330.00</td>
<td style="text-align: right;">666.00</td>
<td style="text-align: right;">711.00</td>
<td style="text-align: left;">▇▇▃▁▇</td>
</tr>
<tr class="odd">
<td style="text-align: left;">ptratio</td>
<td style="text-align: right;">0</td>
<td style="text-align: right;">1</td>
<td style="text-align: right;">18.46</td>
<td style="text-align: right;">2.16</td>
<td style="text-align: right;">12.60</td>
<td style="text-align: right;">17.40</td>
<td style="text-align: right;">19.05</td>
<td style="text-align: right;">20.20</td>
<td style="text-align: right;">22.00</td>
<td style="text-align: left;">▁▃▅▅▇</td>
</tr>
<tr class="even">
<td style="text-align: left;">black</td>
<td style="text-align: right;">0</td>
<td style="text-align: right;">1</td>
<td style="text-align: right;">356.67</td>
<td style="text-align: right;">91.29</td>
<td style="text-align: right;">0.32</td>
<td style="text-align: right;">375.38</td>
<td style="text-align: right;">391.44</td>
<td style="text-align: right;">396.22</td>
<td style="text-align: right;">396.90</td>
<td style="text-align: left;">▁▁▁▁▇</td>
</tr>
<tr class="odd">
<td style="text-align: left;">lstat</td>
<td style="text-align: right;">0</td>
<td style="text-align: right;">1</td>
<td style="text-align: right;">12.65</td>
<td style="text-align: right;">7.14</td>
<td style="text-align: right;">1.73</td>
<td style="text-align: right;">6.95</td>
<td style="text-align: right;">11.36</td>
<td style="text-align: right;">16.96</td>
<td style="text-align: right;">37.97</td>
<td style="text-align: left;">▇▇▅▂▁</td>
</tr>
<tr class="even">
<td style="text-align: left;">medv</td>
<td style="text-align: right;">0</td>
<td style="text-align: right;">1</td>
<td style="text-align: right;">22.53</td>
<td style="text-align: right;">9.20</td>
<td style="text-align: right;">5.00</td>
<td style="text-align: right;">17.02</td>
<td style="text-align: right;">21.20</td>
<td style="text-align: right;">25.00</td>
<td style="text-align: right;">50.00</td>
<td style="text-align: left;">▂▇▅▁▁</td>
</tr>
</tbody>
</table>
</div>
</div>
<p><strong>Commentary:</strong></p>
<ul>
<li><p>Variables like <code>crim</code>, <code>tax</code>, and <code>lstat</code> exhibit high variability and potential skewness.</p></li>
<li><p><code>chas</code> is binary and acts like a categorical indicator.</p></li>
<li><p>The target variable <code>medv</code> ranges from $5,000 to $50,000 (capped).</p></li>
<li><p><code>rm</code> (average number of rooms) and <code>lstat</code> (lower status population) show notable spread and will likely play strong roles in the model.</p></li>
</ul>
<p>Next, we examine correlations with <code>medv</code>:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb2-1">boston <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">correlate</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> corrr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">focus</span>(medv) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">arrange</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">desc</span>(medv))</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code># A tibble: 13 × 2
   term      medv
   &lt;chr&gt;    &lt;dbl&gt;
 1 rm       0.695
 2 zn       0.360
 3 black    0.333
 4 dis      0.250
 5 chas     0.175
 6 age     -0.377
 7 rad     -0.382
 8 crim    -0.388
 9 nox     -0.427
10 tax     -0.469
11 indus   -0.484
12 ptratio -0.508
13 lstat   -0.738</code></pre>
</div>
</div>
<p><strong>Interpretation of Correlations:</strong></p>
<ul>
<li><p><code>rm</code> shows a <strong>strong positive</strong> correlation with <code>medv</code> — more rooms generally imply higher value.</p></li>
<li><p><code>lstat</code> and <code>crim</code> have <strong>strong negative</strong> correlations — as lower status or crime increases, housing values drop.</p></li>
<li><p><code>nox</code>, <code>age</code>, and <code>ptratio</code> also show negative correlations with price, hinting at socio-environmental effects.</p></li>
</ul>
<p>These insights will guide us in building and evaluating our model.</p>
</section>
<section id="exploratory-data-analysis" class="level1" data-number="4">
<h1 data-number="4"><span class="header-section-number">4</span> Exploratory Data Analysis</h1>
<p>Let’s visualize some of the most influential variables in relation to <code>medv</code>, our target variable. These exploratory graphs help reveal potential linear or nonlinear relationships, outliers, or the need for transformation.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb4-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Define individual plots with improved formatting for Quarto rendering</span></span>
<span id="cb4-2">p1 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggplot</span>(boston, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(rm, medv)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb4-3">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_point</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">alpha =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">color =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#2c7fb8"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb4-4">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_smooth</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">method =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"lm"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">se =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">color =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"black"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb4-5">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">labs</span>(</span>
<span id="cb4-6">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">title =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Rooms</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">\n</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">vs. Median Value"</span>,</span>
<span id="cb4-7">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Average Number of Rooms (rm)"</span>,</span>
<span id="cb4-8">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Median Value of Homes ($1000s)"</span></span>
<span id="cb4-9">  ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb4-10">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme_minimal</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb4-11">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">plot.title =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">element_text</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">size =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">11</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">lineheight =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.1</span>))</span>
<span id="cb4-12"></span>
<span id="cb4-13">p2 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggplot</span>(boston, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(lstat, medv)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb4-14">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_point</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">alpha =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">color =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#de2d26"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb4-15">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_smooth</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">method =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"loess"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">se =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">color =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"black"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb4-16">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">labs</span>(</span>
<span id="cb4-17">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">title =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Lower Status %</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">\n</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">vs. Median Value"</span>,</span>
<span id="cb4-18">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"% Lower Status Population (lstat)"</span>,</span>
<span id="cb4-19">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Median Value of Homes ($1000s)"</span></span>
<span id="cb4-20">  ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb4-21">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme_minimal</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb4-22">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">plot.title =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">element_text</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">size =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">11</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">lineheight =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.1</span>))</span>
<span id="cb4-23"></span>
<span id="cb4-24">p3 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggplot</span>(boston, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(nox, medv)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb4-25">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_point</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">alpha =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">color =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#31a354"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb4-26">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_smooth</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">method =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"loess"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">se =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">color =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"black"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb4-27">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">labs</span>(</span>
<span id="cb4-28">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">title =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"NOx Concentration</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">\n</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">vs. Median Value"</span>,</span>
<span id="cb4-29">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"NOx concentration (ppm)"</span>,</span>
<span id="cb4-30">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Median Value of Homes ($1000s)"</span></span>
<span id="cb4-31">  ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb4-32">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme_minimal</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb4-33">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">plot.title =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">element_text</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">size =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">11</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">lineheight =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.1</span>))</span>
<span id="cb4-34"></span>
<span id="cb4-35">p4 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggplot</span>(boston, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(age, medv)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb4-36">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_point</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">alpha =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">color =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#ff7f00"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb4-37">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_smooth</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">method =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"loess"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">se =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">color =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"black"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb4-38">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">labs</span>(</span>
<span id="cb4-39">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">title =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Old Homes %</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">\n</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">vs. Median Value"</span>,</span>
<span id="cb4-40">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"% Homes Built Before 1940 (age)"</span>,</span>
<span id="cb4-41">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Median Value of Homes ($1000s)"</span></span>
<span id="cb4-42">  ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb4-43">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme_minimal</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb4-44">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">plot.title =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">element_text</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">size =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">11</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">lineheight =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.1</span>))</span>
<span id="cb4-45"></span>
<span id="cb4-46">p5 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggplot</span>(boston, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(tax, medv)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb4-47">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_point</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">alpha =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">color =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#6a3d9a"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb4-48">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_smooth</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">method =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"loess"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">se =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">color =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"black"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb4-49">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">labs</span>(</span>
<span id="cb4-50">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">title =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Tax Rate</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">\n</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">vs. Median Value"</span>,</span>
<span id="cb4-51">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Tax Rate (per $10,000)"</span>,</span>
<span id="cb4-52">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Median Value of Homes ($1000s)"</span></span>
<span id="cb4-53">  ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb4-54">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme_minimal</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb4-55">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">plot.title =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">element_text</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">size =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">11</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">lineheight =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.1</span>))</span>
<span id="cb4-56"></span>
<span id="cb4-57">p6 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggplot</span>(boston, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(dis, medv)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb4-58">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_point</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">alpha =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">color =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#1f78b4"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb4-59">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_smooth</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">method =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"loess"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">se =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">color =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"black"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb4-60">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">labs</span>(</span>
<span id="cb4-61">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">title =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Distance to Jobs</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">\n</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">vs. Median Value"</span>,</span>
<span id="cb4-62">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Weighted Distance to Employment Centers (dis)"</span>,</span>
<span id="cb4-63">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Median Value of Homes ($1000s)"</span></span>
<span id="cb4-64">  ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb4-65">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme_minimal</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb4-66">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">plot.title =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">element_text</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">size =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">11</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">lineheight =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.1</span>))</span></code></pre></div></div>
</div>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb5-1">(p1 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|</span> p2) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">plot_layout</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">guides =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'collect'</span>)</span></code></pre></div></div>
<div class="cell-output-display">
<div>
<figure class="figure">
<p><img src="https://mfatihtuzen.github.io/posts/2025-04-30_rsquared/index_files/figure-html/unnamed-chunk-4-1.png" class="img-fluid figure-img" width="1152"></p>
</figure>
</div>
</div>
</div>
<ul>
<li><strong>Rooms (<code>rm</code>)</strong>: Strong positive linear relationship with <code>medv</code>. More rooms correlate with higher home values.</li>
<li><strong>Lower Status Population (<code>lstat</code>)</strong>: Strong nonlinear inverse relation. Poorer areas tend to have significantly lower housing values.</li>
</ul>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb6-1">(p3 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|</span> p4) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">plot_layout</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">guides =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'collect'</span>)</span></code></pre></div></div>
<div class="cell-output-display">
<div>
<figure class="figure">
<p><img src="https://mfatihtuzen.github.io/posts/2025-04-30_rsquared/index_files/figure-html/unnamed-chunk-5-1.png" class="img-fluid figure-img" width="1152"></p>
</figure>
</div>
</div>
</div>
<ul>
<li><strong>Nitric Oxide (<code>nox</code>)</strong>: Moderate negative relationship — environmental factors like pollution impact price.</li>
<li><strong>Old Homes (<code>age</code>)</strong>: Slight negative trend — older areas may have reduced appeal or value.</li>
</ul>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb7-1">(p5 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|</span> p6) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">plot_layout</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">guides =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'collect'</span>)</span></code></pre></div></div>
<div class="cell-output-display">
<div>
<figure class="figure">
<p><img src="https://mfatihtuzen.github.io/posts/2025-04-30_rsquared/index_files/figure-html/unnamed-chunk-6-1.png" class="img-fluid figure-img" width="1152"></p>
</figure>
</div>
</div>
</div>
<ul>
<li><strong>Tax Rate (<code>tax</code>)</strong>: Higher taxes often relate to lower housing value, possibly due to location or socio-economic constraints.</li>
<li><strong>Distance to Employment Centers (<code>dis</code>)</strong>: Weak to moderate positive correlation. Suburban or well-connected areas might command higher value.</li>
</ul>
<p>These six plots combine both socioeconomic and environmental dimensions of housing value — providing both intuition and modeling direction.</p>
</section>
<section id="modeling-with-tidymodels" class="level1" data-number="5">
<h1 data-number="5"><span class="header-section-number">5</span> Modeling with Tidymodels</h1>
<p>Now that we’ve explored the data, it’s time to fit a model using the <strong>tidymodels</strong> framework. We’ll use a simple linear regression to predict <code>medv</code>, the median home value.</p>
<section id="data-splitting-and-preprocessing" class="level2" data-number="5.1">
<h2 data-number="5.1" class="anchored" data-anchor-id="data-splitting-and-preprocessing"><span class="header-section-number">5.1</span> Data Splitting and Preprocessing</h2>
<p>We begin by splitting the dataset into training and testing sets. The training set will be used to fit the model, and the test set will evaluate its generalization performance.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb8-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">set.seed</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">42</span>)</span>
<span id="cb8-2">split <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">initial_split</span>(boston, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">prop =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.8</span>)</span>
<span id="cb8-3">train <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">training</span>(split)</span>
<span id="cb8-4">test <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">testing</span>(split)</span>
<span id="cb8-5"></span>
<span id="cb8-6">rec <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">recipe</span>(medv <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> ., <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data =</span> train)</span>
<span id="cb8-7">model <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">linear_reg</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">set_engine</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"lm"</span>)</span>
<span id="cb8-8">workflow <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">workflow</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">add_recipe</span>(rec) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">add_model</span>(model)</span></code></pre></div></div>
</div>
</section>
<section id="model-fitting" class="level2" data-number="5.2">
<h2 data-number="5.2" class="anchored" data-anchor-id="model-fitting"><span class="header-section-number">5.2</span> Model Fitting</h2>
<p>We now fit the model to the training data:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb9-1">fit <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">fit</span>(workflow, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data =</span> train)</span></code></pre></div></div>
</div>
</section>
<section id="evaluating-the-model-on-the-training-set" class="level2" data-number="5.3">
<h2 data-number="5.3" class="anchored" data-anchor-id="evaluating-the-model-on-the-training-set"><span class="header-section-number">5.3</span> Evaluating the Model on the Training Set</h2>
<p>Let’s extract the R² and Adjusted R² values from the fitted model:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb10" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb10-1">training_summary <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">glance</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">extract_fit_parsnip</span>(fit))</span>
<span id="cb10-2">training_summary <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> dplyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">select</span>(r.squared, adj.r.squared)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code># A tibble: 1 × 2
  r.squared adj.r.squared
      &lt;dbl&gt;         &lt;dbl&gt;
1     0.726         0.717</code></pre>
</div>
</div>
<p><strong>🔍 Interpretation:</strong></p>
<ul>
<li><strong>R²</strong> measures the proportion of variance in <code>medv</code> explained by the predictors in the training set.</li>
<li><strong>Adjusted R²</strong> adjusts this value by penalizing for the number of predictors, making it more reliable in multi-variable contexts.</li>
</ul>
<p>If R² and Adjusted R² differ significantly, it indicates that some predictors may not be contributing meaningfully to the model.</p>
<blockquote class="blockquote">
<p>Example: A model with 12 predictors might show R² = 0.76, but Adjusted R² = 0.72 — suggesting some predictors are adding complexity without real explanatory power.</p>
</blockquote>
</section>
<section id="test-set-performance" class="level2" data-number="5.4">
<h2 data-number="5.4" class="anchored" data-anchor-id="test-set-performance"><span class="header-section-number">5.4</span> Test Set Performance</h2>
<p>Now we assess the model on the unseen test data:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb12" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb12-1">preds <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">predict</span>(fit, test) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">bind_cols</span>(test)</span>
<span id="cb12-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">metrics</span>(preds, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">truth =</span> medv, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">estimate =</span> .pred)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code># A tibble: 3 × 3
  .metric .estimator .estimate
  &lt;chr&gt;   &lt;chr&gt;          &lt;dbl&gt;
1 rmse    standard       4.79 
2 rsq     standard       0.784
3 mae     standard       3.32 </code></pre>
</div>
</div>
<p><strong>📉 Interpretation:</strong></p>
<ul>
<li>If <strong>test R²</strong> is <strong>much lower</strong> than training R², overfitting may be present.</li>
<li>If <strong>test RMSE</strong> is high, the model’s absolute prediction error is large — another sign of poor generalization.</li>
</ul>
</section>
<section id="cross-validation-for-predicted-r²" class="level2" data-number="5.5">
<h2 data-number="5.5" class="anchored" data-anchor-id="cross-validation-for-predicted-r²"><span class="header-section-number">5.5</span> Cross-Validation for Predicted R²</h2>
<p>To get a more robust performance estimate, we use 10-fold cross-validation:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb14" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb14-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">set.seed</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">42</span>)</span>
<span id="cb14-2">cv <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">vfold_cv</span>(train, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">v =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>)</span>
<span id="cb14-3">resample <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">fit_resamples</span>(</span>
<span id="cb14-4">  workflow,</span>
<span id="cb14-5">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">resamples =</span> cv,</span>
<span id="cb14-6">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">metrics =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">metric_set</span>(rsq, rmse),</span>
<span id="cb14-7">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">control =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">control_resamples</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">save_pred =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>)</span>
<span id="cb14-8">)</span>
<span id="cb14-9"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">collect_metrics</span>(resample)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code># A tibble: 2 × 6
  .metric .estimator  mean     n std_err .config             
  &lt;chr&gt;   &lt;chr&gt;      &lt;dbl&gt; &lt;int&gt;   &lt;dbl&gt; &lt;chr&gt;               
1 rmse    standard   4.79     10  0.384  Preprocessor1_Model1
2 rsq     standard   0.712    10  0.0341 Preprocessor1_Model1</code></pre>
</div>
</div>
<p><strong>✅ Interpretation:</strong></p>
<ul>
<li><strong>Predicted R² (via CV)</strong> tells us how well the model would perform on unseen data across multiple resamples.</li>
<li>It typically lies between training R² and test R².</li>
<li>Consistency between cross-validated and test R² implies a stable model.</li>
</ul>
<div class="callout callout-style-default callout-tip callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Tip
</div>
</div>
<div class="callout-body-container callout-body">
<p>Use cross-validation as a standard evaluation tool, especially when data is limited.</p>
</div>
</div>
<p><strong>💬 Summary of Findings:</strong></p>
<ul>
<li>Our linear model explains a good portion of the variance, but some predictors might be irrelevant or redundant.</li>
<li>Cross-validation confirms the model is relatively stable but leaves room for refinement — possibly through feature selection or nonlinear modeling.</li>
</ul>
<p>In the next step, we can analyze residuals or explore model improvements such as polynomial terms or regularization.</p>
</section>
<section id="residual-diagnostics" class="level2" data-number="5.6">
<h2 data-number="5.6" class="anchored" data-anchor-id="residual-diagnostics"><span class="header-section-number">5.6</span> Residual Diagnostics</h2>
<p>Let’s now check if our linear model satisfies basic regression assumptions. We’ll plot residuals and assess patterns, non-linearity, and potential heteroskedasticity.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb16" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb16-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(broom)</span>
<span id="cb16-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(ggthemes)</span>
<span id="cb16-3"></span>
<span id="cb16-4">aug <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">augment</span>(fit<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>fit<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>fit<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>fit)</span>
<span id="cb16-5"></span>
<span id="cb16-6"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggplot</span>(aug, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(.fitted, .resid)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb16-7">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_point</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">alpha =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">color =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#2c7fb8"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb16-8">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_hline</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">yintercept =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">linetype =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"dashed"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb16-9">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">labs</span>(</span>
<span id="cb16-10">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">title =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Residuals vs Fitted Values"</span>,</span>
<span id="cb16-11">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Fitted Values"</span>,</span>
<span id="cb16-12">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Residuals"</span></span>
<span id="cb16-13">  ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb16-14">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme_minimal</span>()</span></code></pre></div></div>
<div class="cell-output-display">
<div>
<figure class="figure">
<p><img src="https://mfatihtuzen.github.io/posts/2025-04-30_rsquared/index_files/figure-html/unnamed-chunk-12-1.png" class="img-fluid figure-img" width="960"></p>
</figure>
</div>
</div>
</div>
<p><strong>📌 Interpretation:</strong></p>
<ul>
<li>We want residuals to be randomly scattered around zero.</li>
<li>If there’s a pattern or funnel shape, that may indicate <strong>non-linearity</strong> or <strong>heteroskedasticity</strong>.</li>
</ul>
</section>
<section id="improving-the-model-transforming-lstat" class="level2" data-number="5.7">
<h2 data-number="5.7" class="anchored" data-anchor-id="improving-the-model-transforming-lstat"><span class="header-section-number">5.7</span> Improving the Model: Transforming <code>lstat</code></h2>
<p>From our earlier EDA, we saw a strong <strong>nonlinear relationship</strong> between <code>lstat</code> (lower status %) and <code>medv</code>. Let’s try <strong>log-transforming</strong> <code>lstat</code> to capture that curvature.</p>
<section id="updated-recipe-with-transformation" class="level3" data-number="5.7.1">
<h3 data-number="5.7.1" class="anchored" data-anchor-id="updated-recipe-with-transformation"><span class="header-section-number">5.7.1</span> Updated Recipe with Transformation</h3>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb17" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb17-1">rec_log <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">recipe</span>(medv <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> ., <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data =</span> train) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb17-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">step_log</span>(lstat)</span>
<span id="cb17-3"></span>
<span id="cb17-4">workflow_log <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">workflow</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb17-5">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">add_model</span>(model) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb17-6">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">add_recipe</span>(rec_log)</span>
<span id="cb17-7"></span>
<span id="cb17-8">fit_log <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">fit</span>(workflow_log, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data =</span> train)</span></code></pre></div></div>
</div>
</section>
<section id="evaluation-of-transformed-model" class="level3" data-number="5.7.2">
<h3 data-number="5.7.2" class="anchored" data-anchor-id="evaluation-of-transformed-model"><span class="header-section-number">5.7.2</span> Evaluation of Transformed Model</h3>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb18" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb18-1">preds_log <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">predict</span>(fit_log, test) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">bind_cols</span>(test)</span>
<span id="cb18-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">metrics</span>(preds_log, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">truth =</span> medv, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">estimate =</span> .pred)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code># A tibble: 3 × 3
  .metric .estimator .estimate
  &lt;chr&gt;   &lt;chr&gt;          &lt;dbl&gt;
1 rmse    standard       4.43 
2 rsq     standard       0.815
3 mae     standard       3.16 </code></pre>
</div>
</div>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb20" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb20-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">glance</span>(fit_log)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code># A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic   p.value    df logLik   AIC   BIC
      &lt;dbl&gt;         &lt;dbl&gt; &lt;dbl&gt;     &lt;dbl&gt;     &lt;dbl&gt; &lt;dbl&gt;  &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
1     0.785         0.778  4.21      110. 2.64e-121    13 -1147. 2324. 2384.
# ℹ 3 more variables: deviance &lt;dbl&gt;, df.residual &lt;int&gt;, nobs &lt;int&gt;</code></pre>
</div>
</div>
<p><strong>🧠 Interpretation:</strong></p>
<ul>
<li>Compare RMSE and R² from the transformed model to the original.</li>
<li>If we see improvement, the transformation helped capture underlying nonlinearity.</li>
<li><strong>Adjusted R²</strong> is especially helpful here to assess whether the transformation truly improved fit — not just overfit.</li>
</ul>
<div class="callout callout-style-default callout-tip callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Tip
</div>
</div>
<div class="callout-body-container callout-body">
<p>Transformations, polynomial terms, and splines are all valid strategies to improve linear models without abandoning interpretability.</p>
</div>
</div>
<p>With residuals checked and a transformation tested, our next step could be to explore <strong>regularized models</strong> like ridge or lasso regression, or even move beyond linearity with <strong>tree-based</strong> models.</p>
</section>
</section>
</section>
<section id="common-pitfalls-and-misconceptions" class="level1" data-number="6">
<h1 data-number="6"><span class="header-section-number">6</span> Common Pitfalls and Misconceptions</h1>
<p>Even though R² is widely reported and intuitively appealing, its interpretation is often flawed — even by experienced analysts. Here, we’ll go beyond textbook definitions and highlight real-world traps and misunderstandings related to R² and its variants.</p>
<p><strong>🚫 Misconception 1: High R² means the model is good</strong></p>
<ul>
<li>A model with R² = 0.95 may <strong>look impressive</strong>, but that doesn’t guarantee predictive power.</li>
<li>High R² can result from overfitting, especially when the model is complex or contains many predictors.</li>
<li><strong>Adjusted R² and Predicted R²</strong> must be considered to evaluate true usefulness.</li>
</ul>
<p><strong>⚠️ Misconception 2: Adding predictors always improves the model</strong></p>
<ul>
<li>While R² never decreases with more variables, <strong>Adjusted R² can</strong> — and should — if the new variable doesn’t add real value.</li>
<li>Including irrelevant predictors increases complexity without improving explanatory power.</li>
<li>This is a form of <strong>dimensional overfitting</strong>.</li>
</ul>
<p><strong>❌ Misconception 3: R² indicates causality</strong></p>
<ul>
<li>R² quantifies correlation, <strong>not causation</strong>.</li>
<li>A high R² can arise from spurious relationships or confounding variables.</li>
<li>Always supplement with <strong>domain knowledge</strong> and causal reasoning.</li>
</ul>
<p><strong>📉 Misconception 4: R² is a universal performance metric</strong></p>
<ul>
<li>R² only applies to <strong>regression tasks</strong>. Using it for classification models is inappropriate and meaningless.</li>
<li>For binary classification, use metrics like <strong>AUC</strong>, <strong>accuracy</strong>, <strong>precision</strong>, and <strong>recall</strong>.</li>
</ul>
<p><strong>🔍 Misconception 5: Residual plots don’t matter if R² is high</strong></p>
<ul>
<li>A good R² doesn’t guarantee that model assumptions are met.</li>
<li>Residual patterns may still reveal <strong>non-linearity</strong>, <strong>heteroskedasticity</strong>, or <strong>influential outliers</strong>.</li>
<li>Always inspect residual diagnostics.</li>
</ul>
<p><strong>💡 Misconception 6: Predicted R² isn’t necessary</strong></p>
<ul>
<li>Many practitioners report R² and Adjusted R², but <strong>omit cross-validation entirely</strong>.</li>
<li><strong>Predicted R²</strong> (e.g., via 10-fold CV) is the <strong>most honest measure</strong> of model generalizability.</li>
</ul>
<p><strong>🔬 Misconception 7: R² has a fixed interpretation</strong></p>
<ul>
<li>R² values <strong>depend on the context</strong>. In social sciences, an R² of 0.3 can be meaningful, while in physics we expect 0.99+.</li>
<li>A “low” R² doesn’t mean the model is useless — it may reflect inherent variability in human behavior or macroeconomic data.</li>
</ul>
<hr>
<blockquote class="blockquote">
<p><strong>Insight:</strong> Always use R² in context — alongside other metrics, validation strategies, and graphical checks.</p>
</blockquote>
<p>For a deeper dive into R² misconceptions and proper regression diagnostics, see:</p>
<ul>
<li><p>Harrell, F. (2015). <em>Regression Modeling Strategies</em>. Springer.</p></li>
<li><p>Gelman &amp; Hill (2006). <em>Data Analysis Using Regression and Multilevel/Hierarchical Models</em>.</p></li>
<li><p>Burnham &amp; Anderson (2002). <em>Model Selection and Multimodel Inference</em>.</p></li>
<li><p>Kutner et al.&nbsp;(2004). <em>Applied Linear Regression Models</em>.</p></li>
</ul>
<p>Together, these references build the foundation for <strong>responsible model interpretation</strong>.</p>
</section>
<section id="conclusion-recommendations" class="level1" data-number="7">
<h1 data-number="7"><span class="header-section-number">7</span> Conclusion &amp; Recommendations</h1>
<section id="summary" class="level2" data-number="7.1">
<h2 data-number="7.1" class="anchored" data-anchor-id="summary"><span class="header-section-number">7.1</span> 📌 Summary</h2>
<p>In this post, we explored <strong>R²</strong>, <strong>Adjusted R²</strong>, and <strong>Predicted R²</strong> in depth — not just as mathematical constructs, but as tools for critical thinking in modeling. We walked through theory, practical application in R with tidymodels, residual diagnostics, and even model improvement through transformation.</p>
<p>Let’s recap: - <strong>R²</strong> tells us how well our model fits the training data, but can be misleading on its own. - <strong>Adjusted R²</strong> improves upon R² by accounting for model complexity. - <strong>Predicted R²</strong>, evaluated via cross-validation, provides the most trustworthy estimate of real-world performance.</p>
<p>High R² values can be seductive. But as we saw, <strong>they don’t guarantee causality, generalizability, or correctness</strong>. Only by combining R² with residual diagnostics, domain knowledge, and out-of-sample validation can we judge a model responsibly.</p>
</section>
<section id="recommendations-for-practitioners" class="level2" data-number="7.2">
<h2 data-number="7.2" class="anchored" data-anchor-id="recommendations-for-practitioners"><span class="header-section-number">7.2</span> 💡 Recommendations for Practitioners</h2>
<ol type="1">
<li><strong>Always accompany R² with Adjusted and Predicted R²</strong> — never rely on one metric alone.</li>
<li><strong>Perform residual diagnostics</strong> to check linearity, variance assumptions, and outlier influence.</li>
<li><strong>Use cross-validation (e.g., 10-fold)</strong> as a default evaluation strategy, especially when the dataset is not large.</li>
<li><strong>Transform nonlinear predictors</strong> (as we did with <code>lstat</code>) or use flexible models (e.g., splines, GAMs) when needed.</li>
<li><strong>Avoid including irrelevant predictors</strong> — they inflate R² without improving generalization.</li>
<li><strong>Contextualize your R²</strong> — in some fields, a lower R² is still useful; in others, it may signal inadequacy.</li>
<li><strong>Complement numerical metrics with visual tools</strong> — scatterplots, predicted vs.&nbsp;actual plots, and residuals reveal insights numbers alone may miss.</li>
</ol>
</section>
<section id="looking-ahead" class="level2" data-number="7.3">
<h2 data-number="7.3" class="anchored" data-anchor-id="looking-ahead"><span class="header-section-number">7.3</span> 🚀 Looking Ahead</h2>
<p>If you want to take your modeling further: - Try <strong>ridge or lasso regression</strong> to handle multicollinearity. - Explore <strong>tree-based models</strong> (e.g., random forests) when relationships are complex and nonlinear. - Use tools like <strong><code>yardstick</code></strong> and <strong><code>modeltime</code></strong> to automate robust validation and reporting.</p>
<blockquote class="blockquote">
<p>In the end, modeling isn’t just about maximizing R² — it’s about <strong>understanding your data, validating your decisions</strong>, and making <strong>informed predictions</strong>.</p>
</blockquote>
<p>Thanks for reading!</p>
<p>Feel free to share, fork, or reuse this analysis. Questions and comments are welcome.</p>


<!-- -->

</section>
</section>

 ]]></description>
  <category>R</category>
  <category>Statistics</category>
  <category>Machine Learning</category>
  <category>r-squared</category>
  <category>adjusted-r-squared</category>
  <category>predictive-modeling</category>
  <category>tidymodels</category>
  <category>model-evaluation</category>
  <guid>https://mfatihtuzen.github.io/posts/2025-04-30_rsquared/</guid>
  <pubDate>Wed, 30 Apr 2025 00:00:00 GMT</pubDate>
  <media:content url="https://mfatihtuzen.github.io/posts/2025-04-30_rsquared/quote_box.png" medium="image" type="image/png" height="216" width="144"/>
</item>
<item>
  <title>Underrated Gems in R: Must-Know Functions You’re Probably Missing Out On</title>
  <dc:creator>M. Fatih Tüzen</dc:creator>
  <link>https://mfatihtuzen.github.io/posts/2025-03-11_underrated_functions/</link>
  <description><![CDATA[ 






<p>R is packed with powerhouse tools—think dplyr for data wrangling, ggplot2 for stunning visuals, or tidyr for tidying up messes. But beyond the headliners, there’s a lineup of lesser-known functions that deserve a spot in your toolkit. These hidden gems can streamline your code, solve tricky problems, and even make you wonder how you managed without them. In this post, we’ll uncover four underrated R functions: <strong><code>Reduce, vapply, do.call</code></strong> and <strong><code>janitor::clean_names</code></strong>. With practical examples ranging from beginner-friendly to advanced, plus outputs to show you what’s possible, this guide will have you itching to try them out in your next project. Let’s dive in and see what these under-the-radar stars can do!</p>
<section id="reduce-collapse-with-control" class="level2">
<h2 class="anchored" data-anchor-id="reduce-collapse-with-control">1. Reduce: Collapse with Control</h2>
<section id="what-it-does-and-its-arguments" class="level3">
<h3 class="anchored" data-anchor-id="what-it-does-and-its-arguments">What It Does and Its Arguments</h3>
<p>Reduce is a base R function that iteratively applies a two-argument function to a list or vector, shrinking it down to a single result. It’s like a secret weapon for avoiding loops while keeping things elegant.</p>
<p><strong>Key Arguments:</strong></p>
<ul>
<li><p><code>f:</code> The function to apply (e.g., +, *, or a custom one).</p></li>
<li><p><code>x:</code> The list or vector to reduce.</p></li>
<li><p><code>init</code> (optional): A starting value (defaults to the first element of x if omitted).</p></li>
<li><p><code>accumulate</code> (optional): If TRUE, returns all intermediate results (defaults to FALSE).</p></li>
</ul>
</section>
<section id="use-cases" class="level3">
<h3 class="anchored" data-anchor-id="use-cases">Use Cases</h3>
<ul>
<li><p>Summing or multiplying without explicit iteration.</p></li>
<li><p>Combining data structures step-by-step.</p></li>
<li><p>Simplifying recursive tasks.</p></li>
</ul>
</section>
<section id="examples" class="level3">
<h3 class="anchored" data-anchor-id="examples">Examples</h3>
<section id="simple-quick-sum" class="level4">
<h4 class="anchored" data-anchor-id="simple-quick-sum">Simple: Quick Sum</h4>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1">numbers <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span></span>
<span id="cb1-2">total <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">Reduce</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">`</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">+</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">`</span>, numbers)</span>
<span id="cb1-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">print</span>(total)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 15</code></pre>
</div>
</div>
<p><em><strong>Explanation</strong>:</em> Reduce adds 1 + 2 = 3, then 3 + 3 = 6, 6 + 4 = 10, and 10 + 5 = 15. It’s a sleek alternative to sum().</p>
</section>
<section id="intermediate-string-building" class="level4">
<h4 class="anchored" data-anchor-id="intermediate-string-building">Intermediate: String Building</h4>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb3-1">words <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"R"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"is"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"awesome"</span>)</span>
<span id="cb3-2">sentence <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">Reduce</span>(paste, words, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">init =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">""</span>)</span>
<span id="cb3-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">print</span>(sentence)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] " R is awesome"</code></pre>
</div>
</div>
<p><em><strong>Explanation</strong>:</em> Starting with an empty string (init = ““), Reduce glues the words together with spaces. Skip init, and it starts with”R”, which might not be what you want.</p>
</section>
<section id="advanced-merging-data-frames" class="level4">
<h4 class="anchored" data-anchor-id="advanced-merging-data-frames">Advanced: Merging Data Frames</h4>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb5-1">df1 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">a =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">b =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"x"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"y"</span>))</span>
<span id="cb5-2">df2 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">a =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">b =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"z"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"w"</span>))</span>
<span id="cb5-3">df3 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">a =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">6</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">b =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"p"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"q"</span>))</span>
<span id="cb5-4">combined <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">Reduce</span>(rbind, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">list</span>(df1, df2, df3))</span>
<span id="cb5-5"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">print</span>(combined)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>  a b
1 1 x
2 2 y
3 3 z
4 4 w
5 5 p
6 6 q</code></pre>
</div>
</div>
<p><em><strong>Explanation</strong>:</em> Reduce stacks three data frames row-wise, pairing them up one by one. It’s a loop-free way to handle multiple merges.</p>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Note</span>A Quick Note on purrr::reduce()
</div>
</div>
<div class="callout-body-container callout-body">
<p>If you’re a fan of the tidyverse, check out purrr::reduce(). It’s a modern take on base R’s Reduce, offering a consistent syntax with other purrr functions (like .x and .y for arguments) and handy shortcuts like ~ .x + .y for inline functions. It also defaults to left-to-right reduction but can go right-to-left with reduce_right(). Worth a look if you want a more polished, tidyverse-friendly alternative!</p>
<p>Here’s an intermediate-level example of using the <code>reduce()</code> function from the <code>purrr</code> package for joining multiple dataframes:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb7-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(purrr)</span>
<span id="cb7-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(dplyr)</span>
<span id="cb7-3"></span>
<span id="cb7-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Create three sample dataframes representing different aspects of customer data</span></span>
<span id="cb7-5">customers <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(</span>
<span id="cb7-6">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">customer_id =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>,</span>
<span id="cb7-7">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">name =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Alice"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Bob"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Charlie"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Diana"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Edward"</span>),</span>
<span id="cb7-8">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">age =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">32</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">45</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">28</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">36</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">52</span>)</span>
<span id="cb7-9">)</span>
<span id="cb7-10"></span>
<span id="cb7-11">orders <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(</span>
<span id="cb7-12">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">order_id =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">101</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">108</span>,</span>
<span id="cb7-13">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">customer_id =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>),</span>
<span id="cb7-14">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">order_date =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">as.Date</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"2023-01-15"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"2023-01-20"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"2023-02-10"</span>, </span>
<span id="cb7-15">                        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"2023-01-05"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"2023-02-15"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"2023-03-20"</span>,</span>
<span id="cb7-16">                        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"2023-02-25"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"2023-03-10"</span>)),</span>
<span id="cb7-17">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">amount =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">120.50</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">85.75</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">200.00</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">45.99</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">75.25</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">150.00</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">95.50</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">210.25</span>)</span>
<span id="cb7-18">)</span>
<span id="cb7-19"></span>
<span id="cb7-20">feedback <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(</span>
<span id="cb7-21">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">feedback_id =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">201</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">206</span>,</span>
<span id="cb7-22">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">customer_id =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>),</span>
<span id="cb7-23">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">rating =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>),</span>
<span id="cb7-24">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">feedback_date =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">as.Date</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"2023-01-20"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"2023-01-25"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"2023-01-10"</span>,</span>
<span id="cb7-25">                          <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"2023-02-20"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"2023-03-01"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"2023-03-15"</span>))</span>
<span id="cb7-26">)</span>
<span id="cb7-27"></span>
<span id="cb7-28"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># List of dataframes to join with the joining column</span></span>
<span id="cb7-29">dataframes_to_join <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">list</span>(</span>
<span id="cb7-30">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">list</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">df =</span> customers, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">by =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"customer_id"</span>),</span>
<span id="cb7-31">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">list</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">df =</span> orders, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">by =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"customer_id"</span>),</span>
<span id="cb7-32">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">list</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">df =</span> feedback, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">by =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"customer_id"</span>)</span>
<span id="cb7-33">)</span>
<span id="cb7-34"></span>
<span id="cb7-35"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Using reduce to join all dataframes</span></span>
<span id="cb7-36"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Start with customers dataframe and progressively join the others</span></span>
<span id="cb7-37">joined_data <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">reduce</span>(</span>
<span id="cb7-38">  dataframes_to_join[<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>],  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Exclude first dataframe as it's our starting point</span></span>
<span id="cb7-39">  <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(acc, x) {</span>
<span id="cb7-40">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">left_join</span>(acc, x<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>df, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">by =</span> x<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>by)</span>
<span id="cb7-41">  },</span>
<span id="cb7-42">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">.init =</span> dataframes_to_join[[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]]<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>df  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Start with customers dataframe</span></span>
<span id="cb7-43">)</span>
<span id="cb7-44"></span>
<span id="cb7-45"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># View the result</span></span>
<span id="cb7-46"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">print</span>(joined_data)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>   customer_id    name age order_id order_date amount feedback_id rating
1            1   Alice  32      101 2023-01-15 120.50         201      4
2            2     Bob  45      102 2023-01-20  85.75         202      5
3            2     Bob  45      103 2023-02-10 200.00         202      5
4            3 Charlie  28      104 2023-01-05  45.99         203      3
5            3 Charlie  28      104 2023-01-05  45.99         204      4
6            3 Charlie  28      105 2023-02-15  75.25         203      3
7            3 Charlie  28      105 2023-02-15  75.25         204      4
8            3 Charlie  28      106 2023-03-20 150.00         203      3
9            3 Charlie  28      106 2023-03-20 150.00         204      4
10           4   Diana  36      107 2023-02-25  95.50         205      5
11           5  Edward  52      108 2023-03-10 210.25         206      4
   feedback_date
1     2023-01-20
2     2023-01-25
3     2023-01-25
4     2023-01-10
5     2023-02-20
6     2023-01-10
7     2023-02-20
8     2023-01-10
9     2023-02-20
10    2023-03-01
11    2023-03-15</code></pre>
</div>
</div>
<p>This example demonstrates how to use <code>reduce()</code> to join multiple dataframes in a sequential, elegant way. This pattern is particularly useful when dealing with complex data integration tasks where you need to combine multiple data sources with a common identifier.</p>
</div>
</div>
</section>
</section>
</section>
<section id="vapply-iteration-with-assurance" class="level2">
<h2 class="anchored" data-anchor-id="vapply-iteration-with-assurance">2. vapply: Iteration with Assurance</h2>
<section id="what-it-does-and-its-arguments-1" class="level3">
<h3 class="anchored" data-anchor-id="what-it-does-and-its-arguments-1">What It Does and Its Arguments</h3>
<p>vapply is another base R gem, similar to lapply but with a twist: it forces you to specify the output type and length upfront. This makes it safer and more predictable, especially for critical tasks.</p>
<p><strong>Key Arguments:</strong></p>
<ul>
<li><p><code>X</code>: The list or vector to process.</p></li>
<li><p><code>FUN</code>: The function to apply to each element.</p></li>
<li><p><code>FUN.VALUE</code>: A template for the output (e.g., numeric(1) for a single number).</p></li>
</ul>
</section>
<section id="use-cases-1" class="level3">
<h3 class="anchored" data-anchor-id="use-cases-1">Use Cases</h3>
<ul>
<li><p>Guaranteeing consistent output types.</p></li>
<li><p>Extracting specific stats from lists.</p></li>
<li><p>Writing reliable code for packages or production.</p></li>
</ul>
</section>
<section id="examples-1" class="level3">
<h3 class="anchored" data-anchor-id="examples-1">Examples</h3>
<section id="simple-doubling-up" class="level4">
<h4 class="anchored" data-anchor-id="simple-doubling-up">Simple: Doubling Up</h4>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb9-1">values <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span></span>
<span id="cb9-2">doubled <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">vapply</span>(values, <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(x) x <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">numeric</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>))</span>
<span id="cb9-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">print</span>(doubled)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 2 4 6</code></pre>
</div>
</div>
<p><em><strong>Explanation</strong>:</em> Each value doubles, and numeric(1) ensures a numeric vector—simple and rock-solid.</p>
</section>
<section id="intermediate-word-lengths" class="level4">
<h4 class="anchored" data-anchor-id="intermediate-word-lengths">Intermediate: Word Lengths</h4>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb11" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb11-1">terms <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"data"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"science"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"R"</span>)</span>
<span id="cb11-2">lengths <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">vapply</span>(terms, nchar, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">numeric</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>))</span>
<span id="cb11-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">print</span>(lengths)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>   data science       R 
      4       7       1 </code></pre>
</div>
</div>
<p><em><strong>Explanation</strong>:</em> vapply counts characters per word, delivering a numeric vector every time—no surprises like sapply might throw.</p>
</section>
<section id="advanced-stats-snapshot" class="level4">
<h4 class="anchored" data-anchor-id="advanced-stats-snapshot">Advanced: Stats Snapshot</h4>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb13" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb13-1">samples <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">list</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>), <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>), <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">6</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">7</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span>))</span>
<span id="cb13-2">stats <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">vapply</span>(samples, <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(x) <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">mean =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(x), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sd =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sd</span>(x)), <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">numeric</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>))</span>
<span id="cb13-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">print</span>(stats)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>     [,1]      [,2] [,3]
mean    2 4.5000000    7
sd      1 0.7071068    1</code></pre>
</div>
</div>
<p><em><strong>Explanation</strong>:</em> For each sample, vapply computes mean and standard deviation, returning a matrix (2 rows, 3 columns). It’s a tidy, type-safe summary.</p>
</section>
</section>
</section>
<section id="do.call-dynamic-function-magic" class="level2">
<h2 class="anchored" data-anchor-id="do.call-dynamic-function-magic">3. do.call: Dynamic Function Magic</h2>
<section id="what-it-does-and-its-arguments-2" class="level3">
<h3 class="anchored" data-anchor-id="what-it-does-and-its-arguments-2">What It Does and Its Arguments</h3>
<p>do.call in base R lets you call a function with a list of arguments, making it a go-to for flexible, on-the-fly operations. It’s like having a universal remote for your functions.</p>
<p><strong>Key Arguments:</strong></p>
<ul>
<li><p><code>what</code>: The function to call (e.g., rbind, paste).</p></li>
<li><p><code>args</code>: A list of arguments to pass.</p></li>
<li><p><code>quote</code> (optional): Rarely used, defaults to FALSE.</p></li>
</ul>
</section>
<section id="use-cases-2" class="level3">
<h3 class="anchored" data-anchor-id="use-cases-2">Use Cases</h3>
<ul>
<li><p>Combining variable inputs.</p></li>
<li><p>Running functions dynamically.</p></li>
<li><p>Simplifying calls with list-based data.</p></li>
</ul>
</section>
<section id="examples-2" class="level3">
<h3 class="anchored" data-anchor-id="examples-2">Examples</h3>
<section id="simple-vector-mashup" class="level4">
<h4 class="anchored" data-anchor-id="simple-vector-mashup">Simple: Vector Mashup</h4>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb15" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb15-1">chunks <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">list</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">6</span>)</span>
<span id="cb15-2">all <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">do.call</span>(c, chunks)</span>
<span id="cb15-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">print</span>(all)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 1 2 3 4 5 6</code></pre>
</div>
</div>
<p><em><strong>Explanation</strong>:</em> do.call feeds the list to c(), stitching the vectors together effortlessly.</p>
</section>
<section id="intermediate-custom-join" class="level4">
<h4 class="anchored" data-anchor-id="intermediate-custom-join">Intermediate: Custom Join</h4>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb17" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb17-1">bits <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">list</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Code"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Runs"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Fast"</span>)</span>
<span id="cb17-2">joined <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">do.call</span>(paste, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(bits, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">list</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sep =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"|"</span>)))</span>
<span id="cb17-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">print</span>(joined)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] "Code|Runs|Fast"</code></pre>
</div>
</div>
<p><em><strong>Explanation</strong>:</em> do.call combines the list with a sep argument, creating a piped string in one smooth move.</p>
</section>
<section id="advanced-flexible-binding" class="level4">
<h4 class="anchored" data-anchor-id="advanced-flexible-binding">Advanced: Flexible Binding</h4>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb19" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb19-1">df_list <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">list</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>), <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>))</span>
<span id="cb19-2">direction <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"vertical"</span></span>
<span id="cb19-3">bound <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">do.call</span>(<span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> (direction <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"vertical"</span>) rbind <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span> cbind, df_list)</span>
<span id="cb19-4"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">print</span>(bound)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>  x
1 1
2 2
3 3
4 4</code></pre>
</div>
</div>
<p><em><strong>Explanation</strong>:</em> With direction = “vertical”, do.call uses rbind to stack rows. Change it to “horizontal”, and cbind takes over—dynamic and smart.</p>
</section>
</section>
</section>
<section id="janitorclean_names-tame-your-column-chaos" class="level2">
<h2 class="anchored" data-anchor-id="janitorclean_names-tame-your-column-chaos">4. janitor::clean_names: Tame Your Column Chaos</h2>
<section id="what-it-does-and-its-arguments-3" class="level3">
<h3 class="anchored" data-anchor-id="what-it-does-and-its-arguments-3">What It Does and Its Arguments</h3>
<p>From the janitor package, clean_names() transforms messy column names into consistent, code-friendly formats (e.g., lowercase with underscores). It’s a time-saver you’ll wish you’d known sooner.</p>
<p><strong>Key Arguments:</strong></p>
<ul>
<li><p><code>dat</code>: The data frame to clean.</p></li>
<li><p><code>case</code>: The style for names (e.g., “snake”, “small_camel”, defaults to “snake”).</p></li>
<li><p><code>replace</code>: A named vector for custom replacements (optional).</p></li>
</ul>
</section>
<section id="use-cases-3" class="level3">
<h3 class="anchored" data-anchor-id="use-cases-3">Use Cases</h3>
<ul>
<li><p>Standardizing imported data with ugly headers.</p></li>
<li><p>Prepping data frames for analysis or plotting.</p></li>
<li><p>Avoiding frustration with inconsistent naming.</p></li>
</ul>
</section>
<section id="examples-3" class="level3">
<h3 class="anchored" data-anchor-id="examples-3">Examples</h3>
<section id="simple-basic-cleanup" class="level4">
<h4 class="anchored" data-anchor-id="simple-basic-cleanup">Simple: Basic Cleanup</h4>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb21" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb21-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(janitor)</span>
<span id="cb21-2"></span>
<span id="cb21-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Create a dataframe with messy column names</span></span>
<span id="cb21-4">df <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(</span>
<span id="cb21-5">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">`</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">First Name</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">`</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"John"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Mary"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"David"</span>),</span>
<span id="cb21-6">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">`</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">Last.Name</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">`</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Smith"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Johnson"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Williams"</span>),</span>
<span id="cb21-7">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">`</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">Email-Address</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">`</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"john@example.com"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"mary@example.com"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"david@example.com"</span>),</span>
<span id="cb21-8">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">`</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">Annual Income ($)</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">`</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">65000</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">78000</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">52000</span>),</span>
<span id="cb21-9">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">check.names =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span></span>
<span id="cb21-10">)</span>
<span id="cb21-11"></span>
<span id="cb21-12"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># View original column names</span></span>
<span id="cb21-13"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">names</span>(df)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] "First Name"        "Last.Name"         "Email-Address"    
[4] "Annual Income ($)"</code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb23" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb23-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Clean the names</span></span>
<span id="cb23-2">clean_df <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">clean_names</span>(df)</span>
<span id="cb23-3"></span>
<span id="cb23-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># View cleaned column names</span></span>
<span id="cb23-5"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">names</span>(clean_df)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] "first_name"    "last_name"     "email_address" "annual_income"</code></pre>
</div>
</div>
<p>What <code>clean_names()</code> specifically does:</p>
<ul>
<li><p>Converts all names to lowercase</p></li>
<li><p>Replaces spaces with underscores</p></li>
<li><p>Removes special characters like periods and hyphens</p></li>
<li><p>Creates names that are valid R variable names and follow standard naming conventions</p></li>
</ul>
<p>This standardization makes your data more consistent, easier to work with, and helps prevent errors when manipulating or joining datasets.</p>
</section>
<section id="intermediate-custom-style" class="level4">
<h4 class="anchored" data-anchor-id="intermediate-custom-style">Intermediate: Custom Style</h4>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb25" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb25-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(dplyr)</span>
<span id="cb25-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(purrr)</span>
<span id="cb25-3"></span>
<span id="cb25-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Create multiple dataframes with inconsistent naming</span></span>
<span id="cb25-5">df1 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(</span>
<span id="cb25-6">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">`</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">Customer ID</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">`</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>,</span>
<span id="cb25-7">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">`</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">First Name</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">`</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"John"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Mary"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"David"</span>),</span>
<span id="cb25-8">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">`</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">LAST NAME</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">`</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Smith"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Johnson"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Williams"</span>),</span>
<span id="cb25-9">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">check.names =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span></span>
<span id="cb25-10">)</span>
<span id="cb25-11"></span>
<span id="cb25-12">df2 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(</span>
<span id="cb25-13">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">`</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">customer.id</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">`</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">6</span>,</span>
<span id="cb25-14">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">`</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">firstName</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">`</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Michael"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Linda"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"James"</span>),</span>
<span id="cb25-15">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">`</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">lastName</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">`</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Brown"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Davis"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Miller"</span>),</span>
<span id="cb25-16">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">check.names =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span></span>
<span id="cb25-17">)</span>
<span id="cb25-18"></span>
<span id="cb25-19">df3 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(</span>
<span id="cb25-20">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">`</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">cust_id</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">`</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">7</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">9</span>,</span>
<span id="cb25-21">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">`</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">first-name</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">`</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Robert"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Jennifer"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Thomas"</span>),</span>
<span id="cb25-22">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">`</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">last-name</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">`</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Wilson"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Martinez"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Anderson"</span>),</span>
<span id="cb25-23">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">check.names =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span></span>
<span id="cb25-24">)</span>
<span id="cb25-25"></span>
<span id="cb25-26"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># List of dataframes</span></span>
<span id="cb25-27">dfs <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">list</span>(df1, df2, df3)</span>
<span id="cb25-28"></span>
<span id="cb25-29"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Clean names of all dataframes</span></span>
<span id="cb25-30">clean_dfs <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">map</span>(dfs, clean_names)</span>
<span id="cb25-31"></span>
<span id="cb25-32"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Print column names for each cleaned dataframe</span></span>
<span id="cb25-33"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">map</span>(clean_dfs, names)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[[1]]
[1] "customer_id" "first_name"  "last_name"  

[[2]]
[1] "customer_id" "first_name"  "last_name"  

[[3]]
[1] "cust_id"    "first_name" "last_name" </code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb27" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb27-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Bind the dataframes (now possible because of standardized column names)</span></span>
<span id="cb27-2">combined_df <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">bind_rows</span>(clean_dfs)</span>
<span id="cb27-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">print</span>(combined_df)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>  customer_id first_name last_name cust_id
1           1       John     Smith      NA
2           2       Mary   Johnson      NA
3           3      David  Williams      NA
4           4    Michael     Brown      NA
5           5      Linda     Davis      NA
6           6      James    Miller      NA
7          NA     Robert    Wilson       7
8          NA   Jennifer  Martinez       8
9          NA     Thomas  Anderson       9</code></pre>
</div>
</div>
<p>This code demonstrates a more advanced use case of the <code>clean_names()</code> function when working with multiple data frames that have inconsistent naming conventions. Note that because of the different column names for customer ID, we have missing values in the combined dataframe. This example demonstrates why standardized naming is important.</p>
</section>
<section id="advanced-targeted-fixes" class="level4">
<h4 class="anchored" data-anchor-id="advanced-targeted-fixes">Advanced: Targeted Fixes</h4>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb29" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb29-1">df <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"ID#"</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Sales_%"</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">20</span>), <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Q1 Revenue"</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">200</span>))</span>
<span id="cb29-2">cleaned <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">clean_names</span>(df, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">replace =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#"</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"_num"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"%"</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"_pct"</span>))</span>
<span id="cb29-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">print</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">names</span>(cleaned))</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] "id"         "sales"      "q1_revenue"</code></pre>
</div>
</div>
<p><em><strong>Explanation</strong>:</em> Custom replace swaps # for _num and % for _pct, while clean_names handles the rest—precision meets polish.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb31" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb31-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(readxl)</span>
<span id="cb31-2"></span>
<span id="cb31-3"></span>
<span id="cb31-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Create a temporary Excel file with problematic column names</span></span>
<span id="cb31-5">temp_file <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">tempfile</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">fileext =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">".xlsx"</span>)</span>
<span id="cb31-6">df <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(</span>
<span id="cb31-7">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">`</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">ID#</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">`</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>,</span>
<span id="cb31-8">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">`</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">%_Completed</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">`</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">85</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">92</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">78</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">65</span>),</span>
<span id="cb31-9">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">`</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">Result (Pass/Fail)</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">`</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Pass"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Pass"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Fail"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Pass"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Fail"</span>),</span>
<span id="cb31-10">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">`</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">μg/mL</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">`</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.8</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.3</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.2</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.4</span>),</span>
<span id="cb31-11">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">`</span><span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">p-value</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">`</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.03</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.01</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.08</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.002</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.06</span>),</span>
<span id="cb31-12">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">check.names =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span></span>
<span id="cb31-13">)</span>
<span id="cb31-14"></span>
<span id="cb31-15"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Save as Excel (simulating real-world data source)</span></span>
<span id="cb31-16"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> (<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">require</span>(writexl)) {</span>
<span id="cb31-17">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">write_xlsx</span>(df, temp_file)</span>
<span id="cb31-18">} <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span> {</span>
<span id="cb31-19">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Fall back to CSV if writexl not available</span></span>
<span id="cb31-20">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">write.csv</span>(df, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sub</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">\\</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">.xlsx$"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">".csv"</span>, temp_file), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">row.names =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span>)</span>
<span id="cb31-21">  temp_file <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sub</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">\\</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">.xlsx$"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">".csv"</span>, temp_file)</span>
<span id="cb31-22">}</span>
<span id="cb31-23"></span>
<span id="cb31-24"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Read the file back</span></span>
<span id="cb31-25"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> (temp_file <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sub</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">\\</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">.xlsx$"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">".csv"</span>, temp_file)) {</span>
<span id="cb31-26">  imported_df <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">read.csv</span>(temp_file, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">check.names =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span>)</span>
<span id="cb31-27">} <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span> {</span>
<span id="cb31-28">  imported_df <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">read_excel</span>(temp_file)</span>
<span id="cb31-29">}</span>
<span id="cb31-30"></span>
<span id="cb31-31"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># View original column names</span></span>
<span id="cb31-32"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">print</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">names</span>(imported_df))</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] "ID#"                "%_Completed"        "Result (Pass/Fail)"
[4] "μg/mL"              "p-value"           </code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb33" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb33-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Create custom replacements</span></span>
<span id="cb33-2">custom_replacements <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(</span>
<span id="cb33-3">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"μg"</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"ug"</span>,  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Replace Greek letter</span></span>
<span id="cb33-4">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"%"</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"percent"</span>,  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Replace percent symbol</span></span>
<span id="cb33-5">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#"</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"num"</span>   <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Replace hash</span></span>
<span id="cb33-6">)</span>
<span id="cb33-7"></span>
<span id="cb33-8"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Clean with custom replacements</span></span>
<span id="cb33-9">clean_df <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> imported_df <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb33-10">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">clean_names</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb33-11">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rename_with</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> stringr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">str_replace_all</span>(., <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"p_value"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"probability"</span>))</span>
<span id="cb33-12"></span>
<span id="cb33-13"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># View cleaned column names</span></span>
<span id="cb33-14"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">print</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">names</span>(clean_df))</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] "id_number"         "percent_completed" "result_pass_fail" 
[4] "mg_m_l"            "probability"      </code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb35" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb35-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Print the cleaned dataframe</span></span>
<span id="cb35-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">print</span>(clean_df)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code># A tibble: 5 × 5
  id_number percent_completed result_pass_fail mg_m_l probability
      &lt;dbl&gt;             &lt;dbl&gt; &lt;chr&gt;             &lt;dbl&gt;       &lt;dbl&gt;
1         1                85 Pass                0.5       0.03 
2         2                92 Pass                0.8       0.01 
3         3                78 Fail                0.3       0.08 
4         4               100 Pass                1.2       0.002
5         5                65 Fail                0.4       0.06 </code></pre>
</div>
</div>
<p>The final output shows the transformation from problematic column names to standardized ones:</p>
<p>From:</p>
<ul>
<li><p><code>ID#</code></p></li>
<li><p><code>%_Completed</code></p></li>
<li><p><code>Result (Pass/Fail)</code></p></li>
<li><p><code>μg/mL</code></p></li>
<li><p><code>p-value</code></p></li>
</ul>
<p>To:</p>
<ul>
<li><p><code>id_num</code></p></li>
<li><p><code>percent_completed</code></p></li>
<li><p><code>result_pass_fail</code></p></li>
<li><p><code>ug_m_l</code></p></li>
<li><p><code>probability</code></p></li>
</ul>
<p>This example demonstrates how <code>clean_names()</code> can be part of a more sophisticated data preparation workflow, especially when working with real-world data sources that contain problematic characters and naming conventions.</p>
</section>
</section>
</section>
<section id="conclusion-why-these-functions-deserve-your-attention" class="level2">
<h2 class="anchored" data-anchor-id="conclusion-why-these-functions-deserve-your-attention">Conclusion: Why These Functions Deserve Your Attention</h2>
<p>R’s ecosystem is vast, but it’s easy to stick to the familiar and miss out on tools like Reduce, vapply, do.call and clean_names. These functions might not top the popularity charts, yet they pack a punch—whether it’s collapsing data without loops, ensuring type safety, adapting on the fly, fixing messy names, or mining text for gold. The examples here show just a taste of what they can do, from quick fixes to complex tasks. Curious to see how they fit into your workflow? Fire up R, play with them, and discover how these underdogs can become your new go-tos. What other hidden R treasures have you found? Drop them in the comments—I’d love to hear!</p>
</section>
<section id="references" class="level2">
<h2 class="anchored" data-anchor-id="references">References</h2>
<ul>
<li><p>R Core Team (2025). <em>R: A Language and Environment for Statistical Computing</em>. R Foundation for Statistical Computing, Vienna, Austria. Available at: <a href="https://www.R-project.org/" class="uri">https://www.R-project.org/</a></p></li>
<li><p>Firke, Sam (2023). <em>janitor: Simple Tools for Examining and Cleaning Dirty Data</em>. CRAN. Available at: <a href="https://CRAN.R-project.org/package=janitor" class="uri">https://CRAN.R-project.org/package=janitor</a></p></li>
<li><p>R Documentation for Reduce, vapply, do.call, clean_names.</p></li>
</ul>


</section>

 ]]></description>
  <category>reduce</category>
  <category>vapply</category>
  <category>do.call</category>
  <category>clean_names</category>
  <guid>https://mfatihtuzen.github.io/posts/2025-03-11_underrated_functions/</guid>
  <pubDate>Tue, 11 Mar 2025 00:00:00 GMT</pubDate>
  <media:content url="https://mfatihtuzen.github.io/posts/2025-03-11_underrated_functions/gems.jpg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>Unlocking CBRT Data in R: A Guide to the CBRT R Package</title>
  <dc:creator>M. Fatih Tüzen</dc:creator>
  <link>https://mfatihtuzen.github.io/posts/2024-12-31_cbrt/</link>
  <description><![CDATA[ 






<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://mfatihtuzen.github.io/posts/2024-12-31_cbrt/images/clipboard-2062160124.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>The Central Bank of the Republic of Turkey (CBRT) provides a wealth of economic data crucial for researchers, analysts, and policymakers. Through the Electronic Data Delivery System (EVDS ), users can access time-series data on various economic indicators. With the <code>CBRT</code> R package this process becomes streamlined, empowering users to integrate CBRT data directly into their R workflows. This blog post delves into the details of accessing CBRT data using the package, explaining everything from obtaining an API key to practical examples of retrieving economic series.</p>
<section id="introduction" class="level2">
<h2 class="anchored" data-anchor-id="introduction">Introduction</h2>
<p>The CBRT serves as Turkey’s central bank, tasked with implementing monetary policies and maintaining financial stability. The EVDS (Elektronik Veri Dağıtım Sistemi) is the CBRT’s online data delivery platform, providing access to a vast repository of economic data, including price indices, exchange rates, monetary aggregates, and more. EVDS supports API-based data retrieval, allowing programmatic access to its datasets.</p>
</section>
<section id="evds" class="level2">
<h2 class="anchored" data-anchor-id="evds">EVDS</h2>
<p><a href="https://evds2.tcmb.gov.tr/index.php">The Electronic Data Delivery System (EVDS)</a> is a dynamic and interactive system that presents statistical time series data produced by the CBRT and/or data produced by other institutions and compiled by the CBRT. These data are published on dynamic web pages. They can also be reported in the xls format or through the web service client (json, csv, xml), viewed in the graphics format, and received via e-mail by subscribing to the system. The EVDS was first introduced in 1995 and is available in Turkish and English.</p>
<p>The system provides a rich range of economic data and information to support economic education and foster economic research. Its technical infrastructure was revised in October 2017. The EVDS serves the public with its new facilities and content such as the REST web service, Customization, Reports, Interactive Charts, Frequently Used Data Groups, Recently Updated Data Groups, and data displayed on Turkey and world maps.</p>
</section>
<section id="setting-up-access-the-api-key" class="level2">
<h2 class="anchored" data-anchor-id="setting-up-access-the-api-key">Setting Up Access: The API Key</h2>
<p>To access EVDS data programmatically, you need an API key, which serves as a unique identifier for authenticating your requests.</p>
<ol type="1">
<li><p><strong>Requesting an API Key:</strong><br>
Visit <a href="https://evds2.tcmb.gov.tr/index.php">EVDS</a> and create an account. Once logged in, navigate to the API access section to generate your personal API key.</p></li>
<li><p><strong>Storing Your API Key Securely:</strong><br>
Avoid hardcoding your API key in scripts. Instead, save it in a <code>.txt</code> file and read it into your R session. For example:</p></li>
</ol>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1">api_key <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">readLines</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"path/to/your_api_key.txt"</span>)</span></code></pre></div></div>
</div>
</section>
<section id="cbrt-package" class="level2">
<h2 class="anchored" data-anchor-id="cbrt-package">CBRT Package</h2>
<p>The <strong>CBRT R package</strong>, developed by <a href="https://avesis.metu.edu.tr/etaymaz">Prof.&nbsp;Dr.&nbsp;Erol Taymaz</a> from Middle East Technical University, is a powerful tool designed to simplify data retrieval from the Central Bank of the Republic of Turkey’s (CBRT) Electronic Data Delivery System (EVDS). This package enables users to efficiently access and analyze economic indicators by providing functions for querying data series, retrieving metadata, and searching for relevant datasets through the EVDS API. he CBRT package includes functions for finding, and downloading data from the Central Bank of the Republic of Türkiye’s database. The CBRT database covers more than 40,000 time series variables. For detailed documentation and further insights into the package, you can visit <a href="https://etaymaz.github.io/cbrt-2024.html">this link</a>.</p>
<p>The package is now available at&nbsp;<a href="https://cran.r-project.org/web/packages/CBRT">CRAN</a>&nbsp;(November 13, 2024), and can be installed by</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb2-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">install.packages</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"CBRT"</span>)</span></code></pre></div></div>
</div>
</section>
<section id="core-functions" class="level2">
<h2 class="anchored" data-anchor-id="core-functions">Core Functions</h2>
<p>All <strong>data series</strong> (variables) are classified into <strong>data groups</strong>, and data groups into <strong>data categories</strong>. There are 44 data categories (including the archieved ones), 499 data groups, and 40,826 data series.</p>
<section id="getallcategoriesinfo" class="level3">
<h3 class="anchored" data-anchor-id="getallcategoriesinfo">getAllCategoriesInfo</h3>
<p>The <code>getAllCategoriesInfo</code> function in the <strong>CBRT R package</strong> provides a convenient way to access information about the main data categories available in the Central Bank of the Republic of Türkiye’s (CBRT) Electronic Data Delivery System (EVDS). This function requires a valid API key as an argument to authenticate your request. By retrieving a structured list of these categories, users can explore the high-level organization of economic data offered by the EVDS API.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb3-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(CBRT)</span>
<span id="cb3-2">my_api_key <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">Sys.getenv</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"EVDS_API_KEY"</span>)</span>
<span id="cb3-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"allCBRTCategories"</span>)</span>
<span id="cb3-4">Categories <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> allCBRTCategories</span>
<span id="cb3-5"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">head</span>(Categories)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>   cid                                           topic
1:   1                              MARKET DATA (CBRT)
2:   2                           EXCHANGE RATES (CBRT)
3:   3 INTEREST RATE AND PROFIT RATE STATISTICS (CBRT)
4:   4        MONTHLY MONEY AND BANK STATISTICS (CBRT)
5:   5                    SECURITIES STATISTICS (CBRT)
6:   6      GROSS EXTERNAL DEBT STOCK OF TÜRKİYE (GMB)</code></pre>
</div>
</div>
</section>
<section id="getallgroupsinfo" class="level3">
<h3 class="anchored" data-anchor-id="getallgroupsinfo">getAllGroupsInfo</h3>
<p>The <strong>CBRT R package</strong> offers the <code>getAllGroupsInfo</code> function, which allows users to access detailed information about the groups within specific categories in the Central Bank of the Republic of Turkey’s (CBRT) Electronic Data Delivery System (EVDS). Similar to <code>getAllCategoriesInfo</code>, this function requires a valid API key for authentication. The groups represent subcategories or finer classifications of data within the broader main categories. By leveraging the <code>cid</code> (category ID) variable from the categories table, users can establish a relationship between categories and their corresponding groups. This functionality provides a structured approach to exploring the hierarchy of economic data in EVDS, enabling users to efficiently navigate and identify the datasets most relevant to their research or analysis.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb5-1">Groups <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">getAllGroupsInfo</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">CBRTKey =</span> my_api_key)</span></code></pre></div></div>
<div class="cell-output cell-output-stderr">
<pre><code>Warning in fread(rawToChar(x$content), encoding = "UTF-8", na.strings = c("ND",
: Found and resolved improper quoting in first 100 rows. If the fields are not
quoted (e.g. field separator does not appear within any field), try quote="" to
avoid this warning.</code></pre>
</div>
<div class="cell-output cell-output-stderr">
<pre><code>Warning in fread(rawToChar(x$content), encoding = "UTF-8", na.strings = c("ND",
: Stopped early on line 339. Expected 21 fields but found 42. Consider
fill=TRUE and comment.char=. First discarded non-empty line:
&lt;&lt;https://data.bis.org/topics/TOTAL_CREDIT,https://data.bis.org/topics/TOTAL_CREDIT,Yüzde,Percentage,550201.0,bie_bistopkredi,Finans
Dışı Sektörün Toplam Kredi Kullanımı,Total Credit Utilization in the
Non-Financial Sector,Uluslararası Ödemeler Bankası (BIS),Bank for International
Settlements (BIS),01-04-2025,13.0,ÜÇ
AYLIK,25-03-2026,https://data.bis.org/topics/TOTAL_CREDIT,https://data.bis.org/topics/TOTAL_CREDIT,"Finans
dışı sektöre verilen kredilere ilişkin veri setinde&gt;&gt;</code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb8-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">head</span>(Groups)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>      cid      groupCode
1: 450108  bie_istirakbs
2:   5002 bie_akonutsat4
3:   5002 bie_akonutsat3
4:   5501 bie_imfgdpusdn
5:   3502 bie_tedavuladt
6: 400701   bie_dtitfb10
                                                                   groupName
1:                          Participations and Subsidiaries - Banking Sector
2:        House and Commercial Property Sales Statistics - Second hand sales
3:              House and Commercial Property Sales Statistics - First sales
4:                                                  IMF - GDP, Nominal (USD)
5:                         Banknotes in Circulation By Denomination (Number)
6: Foreign Trade Import Unit Value Index by Classification of BEC (2015=100)
   freq   source
1:    5    BANKS
2:    5 TURKSTAT
3:    5 TURKSTAT
4:    8      IMF
5:    5     CBRT
6:    5 TURKSTAT
                                                                                                                                                                             sourceLink
1: http://www.tcmb.gov.tr/wps/wcm/connect/f41b8ecb-2161-4db0-ac56-df35fb7554cf/MetadataAPB%C4%B02018.pdf?MOD=AJPERES&amp;CACHEID=ROOTWORKSPACE-f41b8ecb-2161-4db0-ac56-df35fb7554cf-ml2zpGJ
2:                                                                                                                              https://veriportali.tuik.gov.tr/en/press/58340/metadata
3:                                                                                                                              https://veriportali.tuik.gov.tr/en/press/58340/metadata
4:                                                                                                                                                                                     
5:                                                               http://www.tcmb.gov.tr/wps/wcm/connect/EN/TCMB+EN/Main+Menu/Banknotes/General+Information+on+Banknotes/Info+Materials/
6:                                                                                                                          https://data.tuik.gov.tr/Search/Search?text=Foreign%20Trade
                                                                                                                                                                                                                         revisionPolicy
1:                                                       http://www.tcmb.gov.tr/wps/wcm/connect/61cbc9ac-f600-4cc9-b167-4322b54d1dd5/Revision+Policy.pdf?MOD=AJPERES&amp;CACHEID=ROOTWORKSPACE-61cbc9ac-f600-4cc9-b167-4322b54d1dd5-m5hiF.Y
2: https://veriportali.tuik.gov.tr/api/en/data/downloads?t=r&amp;p=BWVWVuXn3OZ0HH575Xo6%2Bng%2F8o0JbjZrW7Qm4Fo6IChEUr89cOmVacFcOBPIYSIzc%2BngMbnWHFHcldrrqssexL3nVsLA%2ByB6NViPfIUkNugr%2BoB%2FsjsNRkeGF5BTVjbCFGF0TgEtEgjE46pnK7Sz5Q%3D%3D
3: https://veriportali.tuik.gov.tr/api/en/data/downloads?t=r&amp;p=BWVWVuXn3OZ0HH575Xo6%2Bng%2F8o0JbjZrW7Qm4Fo6IChEUr89cOmVacFcOBPIYSIzc%2BngMbnWHFHcldrrqssexL3nVsLA%2ByB6NViPfIUkNugr%2BoB%2FsjsNRkeGF5BTVjbCFGF0TgEtEgjE46pnK7Sz5Q%3D%3D
4:                                                                                                                                                                                                                                     
5:                                                                                                         http://www.tcmb.gov.tr/wps/wcm/connect/EN/TCMB+EN/Main+Menu/Banknotes/General+Information+on+Banknotes/Banknote+Reproduction
6:                                                                                                                                                                          https://data.tuik.gov.tr/Search/Search?text=Foreign%20Trade
                                                                                                                                                                                  appLink
1: http://www.tcmb.gov.tr/wps/wcm/connect/EN/TCMB+EN/Main+Menu/Statistics/Monetary+and+Financial+Statistics/Monthly+Money+and+Banking+Statistics/Announcements+on+Methodological+Changes/
2:                                                                                   https://dosya.tuik.gov.tr/FileLink/f8dzz-5d29cc13-b3ca-492b-84a1-de8ad44c78d8/02.0040.RP.2025.00_ENG
3:                                                                                   https://dosya.tuik.gov.tr/FileLink/f8dzz-5d29cc13-b3ca-492b-84a1-de8ad44c78d8/02.0040.RP.2025.00_ENG
4:                                                                                                                                                                                       
5:                                                     http://www.tcmb.gov.tr/wps/wcm/connect/EN/TCMB+EN/Main+Menu/Banknotes/General+Information+on+Banknotes/Banknote+Printing+Authority
6:                                                                                                                            https://data.tuik.gov.tr/Search/Search?text=Foreign%20Trade
    firstDate   lastDate
1: 01-02-2026 01-10-2007
2: 01-03-2026 01-01-2013
3: 01-03-2026 01-01-2013
4: 01-01-2026 01-01-2010
5: 01-03-2026 01-01-2009
6: 01-02-2026 01-01-2013</code></pre>
</div>
</div>
<p>Additionally, the groups table contains valuable metadata, including the date ranges for available data, data frequency, and data sources. The frequency of the data is indicated by predefined frequency codes:</p>
<ol type="1">
<li><p>Daily</p></li>
<li><p>Workday</p></li>
<li><p>Weekly</p></li>
<li><p>Biweekly</p></li>
<li><p>Monthly</p></li>
<li><p>Quarterly</p></li>
<li><p>Semiannual</p></li>
<li><p>Annual</p></li>
</ol>
</section>
<section id="getallseriesinfo" class="level3">
<h3 class="anchored" data-anchor-id="getallseriesinfo">getAllSeriesInfo</h3>
<p>The <code>getAllSeriesInfo</code> function in the <strong>CBRT R package</strong> enables users to retrieve up-to-date metadata for data series available in the Central Bank of the Republic of Turkey’s (CBRT) Electronic Data Delivery System (EVDS). This function, like others in the package, requires a valid API key for authentication. The metadata includes essential details such as group codes, series names, and other relevant information about the datasets within a chosen topic. These details help users identify and filter specific series of interest. Furthermore, by utilizing key variables, the series metadata can be linked to the categories and groups tables, allowing users to establish relationships across the data hierarchy. This capability ensures a structured and interconnected exploration of economic datasets, simplifying the process of locating and analyzing relevant data for research or analysis.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb10" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb10-1">Series <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">getAllSeriesInfo</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">CBRTKey =</span> my_api_key)</span></code></pre></div></div>
</div>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb11" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb11-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">head</span>(Series)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>   cid                                  topic    groupCode
1:  13 CENTRAL BANK BALANCE SHEET DATA (CBRT) bie_abanlbil
2:  13 CENTRAL BANK BALANCE SHEET DATA (CBRT) bie_abanlbil
3:  13 CENTRAL BANK BALANCE SHEET DATA (CBRT) bie_abanlbil
4:  13 CENTRAL BANK BALANCE SHEET DATA (CBRT) bie_abanlbil
5:  13 CENTRAL BANK BALANCE SHEET DATA (CBRT) bie_abanlbil
6:  13 CENTRAL BANK BALANCE SHEET DATA (CBRT) bie_abanlbil
                                              groupName freq seriesCode
1: Central Bank Analytical Balance Sheet (Thousand TRY)    2  TP.AB.A01
2: Central Bank Analytical Balance Sheet (Thousand TRY)    2  TP.AB.A02
3: Central Bank Analytical Balance Sheet (Thousand TRY)    2  TP.AB.A03
4: Central Bank Analytical Balance Sheet (Thousand TRY)    2  TP.AB.A04
5: Central Bank Analytical Balance Sheet (Thousand TRY)    2  TP.AB.A05
6: Central Bank Analytical Balance Sheet (Thousand TRY)    2 TP.AB.A051
             seriesName      start        end aggMethod freqname
1:             A.ASSETS 26-12-1980 20-04-2026      last Work day
2:   A.1 FOREIGN ASSETS 26-12-1980 20-04-2026      last Work day
3:  A.2 DOMESTIC ASSETS 26-12-1980 20-04-2026      last Work day
4: A.2A Cash Operations 26-12-1980 31-12-2012      last Work day
5:  A.2Aa Treasury Debt 26-12-1980 20-04-2026      last Work day
6:    A.2Aa1 Securities 24-11-2000 20-04-2026      last Work day
                                                                      tag
1: Balance, Sheet, Analytical, Balance, Sheet, Data, Financial, Statement
2: Balance, Sheet, Analytical, Balance, Sheet, Data, Financial, Statement
3: Balance, Sheet, Analytical, Balance, Sheet, Data, Financial, Statement
4: Balance, Sheet, Analytical, Balance, Sheet, Data, Financial, Statement
5: Balance, Sheet, Analytical, Balance, Sheet, Data, Financial, Statement
6: Balance, Sheet, Analytical, Balance, Sheet, Data, Financial, Statement</code></pre>
</div>
</div>
</section>
<section id="searchcbrt" class="level3">
<h3 class="anchored" data-anchor-id="searchcbrt">searchCBRT</h3>
<p>The <code>searchCBRT</code> function in the <strong>CBRT R package</strong> provides a powerful tool for searching any category, group, or series name within the Central Bank of the Republic of Turkey’s (CBRT) Electronic Data Delivery System (EVDS). By specifying keywords and the desired field to search in, users can efficiently locate relevant datasets. This function simplifies the process of finding specific information within the extensive EVDS repository, enabling direct access to the desired table or dataset. Whether searching for broad topics, specific groups, or individual data series, <code>searchCBRT</code> offers a flexible and efficient way to navigate the system and pinpoint the data needed for analysis.</p>
<p>Suppose we want to find datasets related to “Consumer Prices” within the EVDS system. Using the <code>searchCBRT</code> function, we can search for this keyword in relevant fields to locate the desired tables or series. Here’s how to do it:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb13" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb13-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">searchCBRT</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"consumer price"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">field =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"series"</span>)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>          seriesCode
 1:  TP.ENFBEK.TEA12
 2: TP.ENFBEK.TEA345
 3:     TP.FE.OKTG01
 4:        TP.FG.A09
 5:        TP.FG.A10
 6:       TP.TG2.Y14
 7:       TP.TG2.Y15
 8:   TP.FE25.OKTG01
 9:        TP.FG.F19
10:        TP.FG.F20
                                                                                                          seriesName
 1:              Percentage of households expecting consumer prices to increase more rapidly or at the same rate (%)
 2: Percentage of households expecting consumer prices to stay about the same, fall or increase at a slower rate (%)
 3:                                                                                             Consumer Price Index
 4:                                                                        Consumer Prices Index of Ankara (Archive)
 5:                                                                      Consumer Prices Index of Istanbul (Archive)
 6:                                              Assessment on Consumer prices change rate (over the last 12 months)
 7:             Expectation for consumer prices change rate (over the next 12 months compared to the past 12 months)
 8:                                                                                             Consumer Price Index
 9:                                                                            Ankara Consumer Price Index (Archive)
10:                                                                          Istanbul Consumer Price Index (Archive)
        groupCode
 1:    bie_enfbek
 2:    bie_enfbek
 3:    bie_feoktg
 4: bie_fgtukfiy2
 5: bie_fgtukfiy2
 6:   bie_mbgven2
 7:   bie_mbgven2
 8: bie_oktug2025
 9:   bie_tukfiy1
10:   bie_tukfiy1
                                                                                                groupName
 1:                                                                       Sectoral Inflation Expectations
 2:                                                                       Sectoral Inflation Expectations
 3:                                          Indicators For The CPIs Having Specified Coverage (2003=100)
 4:                                                  Consumer Price Index (1987=100) (TURKSTAT) (Archive)
 5:                                                  Consumer Price Index (1987=100) (TURKSTAT) (Archive)
 6: Seasonally unadjusted Consumer Confidence Index and Indices of Consumer Tendency Survey Questions (*)
 7: Seasonally unadjusted Consumer Confidence Index and Indices of Consumer Tendency Survey Questions (*)
 8:                                          Indicators For The CPIs Having Specified Coverage (2025=100)
 9:                                             Consumer Price Index (1978-1979=100) (TURKSTAT) (Archive)
10:                                             Consumer Price Index (1978-1979=100) (TURKSTAT) (Archive)</code></pre>
</div>
</div>
</section>
<section id="getdataseries" class="level3">
<h3 class="anchored" data-anchor-id="getdataseries">getDataSeries</h3>
<p>The <code>getDataSeries</code> function in the <strong>CBRT R package</strong> is a versatile tool for importing one or more time series directly from the EVDS. This function provides users with several advanced features to customize their data retrieval. For example, users can specify the frequency level (<code>freq</code>), such as daily, weekly, or monthly, and set a date range using the <code>startDate</code> and <code>endDate</code> arguments in the format <code>DD-MM-YYYY</code>. If the <code>endDate</code> is not specified, the function automatically retrieves data up to the latest available point.</p>
<p>An additional feature of <code>getDataSeries</code> is its ability to aggregate higher-frequency data into lower-frequency formats using the <code>aggType</code> argument. Supported aggregation methods include:</p>
<ul>
<li><p><code>avg</code>: Average value,</p></li>
<li><p><code>first</code>: First observation,</p></li>
<li><p><code>last</code>: Last observation,</p></li>
<li><p><code>max</code>: Maximum value,</p></li>
<li><p><code>min</code>: Minimum value,</p></li>
<li><p><code>sum</code>: Summation of values.</p></li>
</ul>
<p>For instance, if weekly data is aggregated to a monthly frequency, the aggregation method is applied to compute the resulting values. Furthermore, the <code>na.rm</code> argument allows users to drop all missing dates, ensuring clean and continuous time series data.</p>
<p>Here’s an example demonstrating its use:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb15" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb15-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Import a time series (e.g., CPI data) with specific parameters</span></span>
<span id="cb15-2">cpi_data <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">getDataSeries</span>(</span>
<span id="cb15-3">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">series =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"TP.FE.OKTG01"</span>),       <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Example series ID</span></span>
<span id="cb15-4">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">CBRTKey =</span> my_api_key,            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Your API key</span></span>
<span id="cb15-5">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">freq =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>,                     <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Monthly frequency</span></span>
<span id="cb15-6">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">startDate =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"01-01-2010"</span>,     <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Start date</span></span>
<span id="cb15-7">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">endDate =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"31-12-2023"</span>,       <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># End date</span></span>
<span id="cb15-8">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">na.rm =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>                  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Remove missing dates</span></span>
<span id="cb15-9">)</span>
<span id="cb15-10"></span>
<span id="cb15-11"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># View the imported data</span></span>
<span id="cb15-12"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">head</span>(cpi_data)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>         time TP.FE.OKTG01
1: 2010-01-15       174.07
2: 2010-02-15       176.59
3: 2010-03-15       177.62
4: 2010-04-15       178.68
5: 2010-05-15       178.04
6: 2010-06-15       177.04</code></pre>
</div>
</div>
<p>For example, we want to fetch exchange rates for USD, EUR, and GBP against the Turkish Lira (TRY) for a specific time period in monthly frequency.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb17" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb17-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Define the series IDs for USD, EUR, and GBP (Sales rate against TRY)</span></span>
<span id="cb17-2">usd_series <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"TP.DK.USD.S"</span></span>
<span id="cb17-3">eur_series <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"TP.DK.EUR.S"</span></span>
<span id="cb17-4">gbp_series <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"TP.DK.GBP.S"</span></span>
<span id="cb17-5"></span>
<span id="cb17-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Define the frequency method</span></span>
<span id="cb17-7">freq <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Monthly frequency</span></span>
<span id="cb17-8"></span>
<span id="cb17-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Define the date range for the data (e.g., from 01-01-2020 to 31-12-2024)</span></span>
<span id="cb17-10">startDate <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"01-01-2020"</span></span>
<span id="cb17-11">endDate <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"31-12-2024"</span></span>
<span id="cb17-12"></span>
<span id="cb17-13"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Fetch the data for USD, EUR, and GBP exchange rates</span></span>
<span id="cb17-14">exchange_data <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">getDataSeries</span>(</span>
<span id="cb17-15">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">series =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(usd_series,eur_series,gbp_series),</span>
<span id="cb17-16">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">CBRTKey =</span> my_api_key,</span>
<span id="cb17-17">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">freq =</span> freq,</span>
<span id="cb17-18">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">startDate =</span> startDate,</span>
<span id="cb17-19">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">endDate =</span> endDate,</span>
<span id="cb17-20">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">na.rm =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span></span>
<span id="cb17-21">)</span>
<span id="cb17-22"></span>
<span id="cb17-23"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">head</span>(exchange_data)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>         time TP.DK.USD.S TP.DK.EUR.S TP.DK.GBP.S
1: 2020-01-15    5.928827    6.586905    7.763218
2: 2020-02-15    6.055370    6.605785    7.872095
3: 2020-03-15    6.325805    7.001341    7.858764
4: 2020-04-15    6.831252    7.430133    8.493257
5: 2020-05-15    6.964488    7.573124    8.588112
6: 2020-06-15    6.821091    7.676245    8.560195</code></pre>
</div>
</div>
</section>
</section>
<section id="conclusion" class="level2">
<h2 class="anchored" data-anchor-id="conclusion">Conclusion</h2>
<p>The <code>CBRT</code> R package is a powerful tool for accessing and analyzing Turkish economic data. By combining the package’s functionality with R’s robust analytical tools, users can unlock insights and streamline their research. Whether you’re tracking inflation trends, analyzing monetary policy impacts, or studying exchange rates, the <code>CBRT</code> package offers a seamless experience.</p>
<section id="references" class="level3">
<h3 class="anchored" data-anchor-id="references">References</h3>
<ol type="1">
<li><p>Taymaz, E. (2024). <em>CBRT R Package</em>. Retrieved from <a href="https://users.metu.edu.tr/etaymaz/cbrt-2024.html#the-package">CBRT Package</a><a href="https://etaymaz.github.io/cbrt-2024.html">Documentation</a></p></li>
<li><p>Central Bank of the Republic of Turkey. <em>Electronic Data Delivery System (EVDS)</em>. Retrieved from <a href="https://evds2.tcmb.gov.tr/index.php?">EVDS</a></p></li>
</ol>


</section>
</section>

 ]]></description>
  <category>R Programming</category>
  <category>CBRT</category>
  <category>EVDS</category>
  <category>Import</category>
  <category>API</category>
  <guid>https://mfatihtuzen.github.io/posts/2024-12-31_cbrt/</guid>
  <pubDate>Tue, 31 Dec 2024 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Extracting Data from OECD Databases in R: Using the oecd and rsdmx Packages</title>
  <dc:creator>M. Fatih Tüzen</dc:creator>
  <link>https://mfatihtuzen.github.io/posts/2024-12-16_oecd/</link>
  <description><![CDATA[ 






<section id="introduction" class="level2">
<h2 class="anchored" data-anchor-id="introduction">Introduction</h2>
<p>The <strong>OECD (Organisation for Economic Co-operation and Development)</strong> provides extensive databases for economic, social, and environmental indicators. Accessing these programmatically through R is efficient and reproducible. In this article, we explore two popular R packages for accessing OECD data—<strong><code>oecd</code></strong> and <strong><code>rsdmx</code></strong>—and discuss critical updates to the OECD Developer API that have impacted package functionality.</p>
<p>We also provide practical examples, emphasize the importance of applying filters during data retrieval, and guide users on how to work with the latest tools to ensure seamless data access.</p>
</section>
<section id="why-programmatic-access-matters" class="level2">
<h2 class="anchored" data-anchor-id="why-programmatic-access-matters">Why Programmatic Access Matters</h2>
<p>Accessing data programmatically offers several benefits:</p>
<ol type="1">
<li><p><strong>Customization</strong>: Tailor requests to retrieve only the data you need (e.g., specific countries, indicators, and years).</p></li>
<li><p><strong>Efficiency</strong>: Save time and bandwidth by filtering data before download.</p></li>
<li><p><strong>Reproducibility</strong>: Ensure that analyses can be easily updated or shared.</p></li>
<li><p><strong>Automation</strong>: Streamline workflows by automating data extraction.</p></li>
</ol>
</section>
<section id="oecd-data-explorer-exploring-and-accessing-data" class="level2">
<h2 class="anchored" data-anchor-id="oecd-data-explorer-exploring-and-accessing-data">OECD Data Explorer: Exploring and Accessing Data</h2>
<p>The OECD provides programmatic access to OECD data for OECD countries and selected non-member economies through a RESTful application programming interface (API) based on the SDMX standard. The APIs allow developers to easily query the OECD data in several ways to create innovative software applications which use dynamically updated OECD data.</p>
<p>The <strong>OECD Data Explorer</strong> is an interactive web-based platform that allows users to explore, visualize, and download data from the OECD databases. It is particularly useful for users who want to manually browse through datasets before deciding on specific data points for analysis. Here, we provide an overview of the <strong>OECD Data Explorer</strong>, including how to navigate the platform, customize filters, and access API links for programmatic use.</p>
<p>The OECD Data Explorer is available at: <a href="https://data-explorer.oecd.org/" class="uri">https://data-explorer.oecd.org/</a></p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://mfatihtuzen.github.io/posts/2024-12-16_oecd/images/oecd_data_explorer.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>When you visit the site, you are greeted with a clean interface for navigating through datasets. The platform organizes data into <strong>themes</strong> such as;</p>
<ul>
<li><p>Economy</p></li>
<li><p>Education</p></li>
<li><p>Environment</p></li>
<li><p>Health</p></li>
<li><p>Innovation and Technology</p></li>
<li><p>Employment</p></li>
</ul>
<p>Each theme contains various datasets that can be explored interactively.</p>
<section id="using-the-oecd-data-explorer" class="level3">
<h3 class="anchored" data-anchor-id="using-the-oecd-data-explorer">Using the OECD Data Explorer</h3>
<section id="search-for-a-dataset" class="level4">
<h4 class="anchored" data-anchor-id="search-for-a-dataset">1. <strong>Search for a Dataset</strong></h4>
<p>The search bar allows you to quickly locate datasets. For example, if you are interested in unemployment data, simply type “unemployment” in the search bar.</p>
</section>
<section id="customize-filters" class="level4">
<h4 class="anchored" data-anchor-id="customize-filters">2. <strong>Customize Filters</strong></h4>
<p>Once you’ve selected a dataset (e.g., <em>Labour Market Statistics</em>), you can apply various filters to narrow down the data you need. Some of them are given below:</p>
<ul>
<li><p><strong>Geographical Region</strong>: Choose specific countries or regions (e.g., USA, France, OECD Total).</p></li>
<li><p><strong>Time Period</strong>: Select years of interest (e.g., 2015–2023).</p></li>
<li><p><strong>Indicator</strong>: Specify what you are analyzing (e.g., Unemployment Rate, Employment-to-Population Ratio).</p></li>
<li><p><strong>Measurement Units</strong>: Choose relevant units (e.g., percentages, index values).</p></li>
</ul>
</section>
<section id="explore-data-visualizations" class="level4">
<h4 class="anchored" data-anchor-id="explore-data-visualizations">3. <strong>Explore Data Visualizations</strong></h4>
<p>The platform provides instant visualizations, such as tables, line charts, and bar charts, based on your selected filters. These visualizations make it easy to understand trends and patterns in the data.</p>
</section>
<section id="exporting-data" class="level4">
<h4 class="anchored" data-anchor-id="exporting-data">4. Exporting Data</h4>
<p>Once you’ve customized the dataset, you can download in available formats, such as <strong>Excel</strong> or <strong>CSV</strong> by manually. the other choice is accessing the API Link. For programmatic access, the <strong>OECD Data Explorer</strong> provides API links that can be used in R or other programming languages. After selecting your filters, click on the <strong>Developer API</strong> and copy the generated link.</p>
<p>For example, let’s want to pull data about the unemployment rates of some countries. After applying the filters I want, such a link will be created.</p>
<p><code>https://sdmx.oecd.org/public/rest/data/OECD.SDD.TPS,DSD_LFS@DF_IALFS_UNE_M,1.0/BEL+AUS+AUT+CAN+DNK+FRA+DEU+GRC+HUN+IRL+ITA+JPN+NLD+NZL+NOR+PRT+SVN+ESP+SWE+CHE+USA+GBR+TUR..PT_LF_SUB._Z.Y._T.Y_GE15..M?startPeriod=2023-11&amp;dimensionAtObservation=AllDimensions</code></p>
<p>This link can be directly used with R packages like <code>rsdmx</code> to fetch data programmatically.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://mfatihtuzen.github.io/posts/2024-12-16_oecd/images/data_page.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>Also you can get detailed information from <a href="https://www.oecd.org/en/data/insights/data-explainers/2024/09/api.html" class="uri">https://www.oecd.org/en/data/insights/data-explainers/2024/09/api.html</a>. This page provides detailed information on how to programmatically retrieve data from the OECD Data Explorer via the API.</p>
</section>
</section>
</section>
<section id="the-oecd-package-accessing-oecd-data-in-r" class="level2">
<h2 class="anchored" data-anchor-id="the-oecd-package-accessing-oecd-data-in-r">The <code>OECD</code> Package: Accessing OECD Data in R</h2>
<p>The <strong><code>oecd</code></strong> package is an R package designed to provide a convenient interface for accessing data from the <strong>OECD Developer API</strong>. It allows users to:</p>
<ul>
<li><p>Explore available datasets in the OECD databases.</p></li>
<li><p>Retrieve filtered data programmatically for specific countries, indicators, and time periods.</p></li>
<li><p>Work with data in a reproducible way directly within R.</p></li>
</ul>
<p>However, the version of the <strong><code>OECD</code></strong> package available on <strong>CRAN</strong> is currently <strong>outdated</strong> due to recent changes in the OECD API (2024). These changes have impacted the functionality of some key features in the CRAN release. You can find more information about changes in the OECD API from <a href="https://www.oecd.org/en/data/insights/data-explainers/2024/09/OECD-DE-FAQ.html" class="uri">https://www.oecd.org/en/data/insights/data-explainers/2024/09/OECD-DE-FAQ.html</a>.</p>
<p>To overcome these limitations, it is recommended to use the <strong>updated version</strong> of the <code>OECD</code>package available on GitHub, which is fully compatible with the latest OECD API.</p>
<p>For installation and usage details, refer to the updated package repository:<br>
<a href="https://github.com/expersso/OECD"><strong>https://github.com/expersso/OECD</strong></a></p>
<p><strong>Installing the Updated <code>oecd</code> Package:</strong></p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Install devtools if not already installed</span></span>
<span id="cb1-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">install.packages</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"devtools"</span>)</span>
<span id="cb1-3"></span>
<span id="cb1-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Install the updated oecd package from GitHub</span></span>
<span id="cb1-5">devtools<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">install_github</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"expersso/OECD"</span>)</span></code></pre></div></div>
</div>
<p>The updated version of the <strong><code>OECD</code></strong>package simplifies interaction with the OECD API, focusing on just two core functions: <strong><code>get_data_structure()</code></strong> and <strong><code>get_dataset()</code></strong>. Here’s a brief overview of their functionality and arguments:</p>
<section id="get_data_structure" class="level3">
<h3 class="anchored" data-anchor-id="get_data_structure">1. <strong><code>get_data_structure()</code></strong></h3>
<p>This function retrieves metadata about a specific dataset from the OECD API. It provides information about variables, classifications, adjustments, unit measures etc. For example, we can access this information about the unemployment rates of some countries by taking the code of the relevant data set from the link given above. Then we can extract dataset information from the link we received from the developer API section, starting with slash (/) after the data expression and up to the next slash (Shown in blue in screenshot).</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://mfatihtuzen.github.io/posts/2024-12-16_oecd/images/data_query.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb2-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(OECD)</span>
<span id="cb2-2">dataset_unemprate <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"OECD.SDD.TPS,DSD_LFS@DF_IALFS_UNE_M,1.0"</span></span>
<span id="cb2-3">data_str <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">get_data_structure</span>(dataset_unemprate)</span>
<span id="cb2-4"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">str</span>(data_str, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">max.level =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>List of 15
 $ VAR_DESC               :'data.frame':    17 obs. of  2 variables:
 $ CL_ACTIVITY_ISIC4      :'data.frame':    958 obs. of  2 variables:
 $ CL_ADJUSTMENT          :'data.frame':    17 obs. of  2 variables:
 $ CL_AGE                 :'data.frame':    308 obs. of  2 variables:
 $ CL_AREA                :'data.frame':    469 obs. of  2 variables:
 $ CL_SECTOR              :'data.frame':    216 obs. of  2 variables:
 $ CL_SEX                 :'data.frame':    7 obs. of  2 variables:
 $ CL_TRANSFORMATION      :'data.frame':    59 obs. of  2 variables:
 $ CL_UNIT_MEASURE        :'data.frame':    670 obs. of  2 variables:
 $ CL_WORKER_STATUS_ICSE93:'data.frame':    13 obs. of  2 variables:
 $ CL_MEASURE_LFS_TPS     :'data.frame':    30 obs. of  2 variables:
 $ CL_DECIMALS            :'data.frame':    16 obs. of  2 variables:
 $ CL_FREQ                :'data.frame':    34 obs. of  2 variables:
 $ CL_OBS_STATUS          :'data.frame':    20 obs. of  2 variables:
 $ CL_UNIT_MULT           :'data.frame':    31 obs. of  4 variables:</code></pre>
</div>
</div>
</section>
<section id="get_dataset" class="level3">
<h3 class="anchored" data-anchor-id="get_dataset">2. <strong><code>get_dataset()</code></strong></h3>
<p>This function retrieves the actual data from a specified dataset, with optional filters for dimensions like country, time, and indicators.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb4-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">get_dataset</span>(</span>
<span id="cb4-2">  dataset,</span>
<span id="cb4-3">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">filter =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NULL</span>,</span>
<span id="cb4-4">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">start_time =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NULL</span>,</span>
<span id="cb4-5">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">end_time =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NULL</span>,</span>
<span id="cb4-6">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">last_n_observations =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NULL</span>,</span>
<span id="cb4-7">  ...</span>
<span id="cb4-8">)</span></code></pre></div></div>
</div>
<p>For filters, you need to start with “/” after the part for dataset and take it until question mark “?”. But be careful, don’t include question mark. For the time filtering, start_time or end_time arguments can be used.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://mfatihtuzen.github.io/posts/2024-12-16_oecd/images/data_query_filters.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb5-1">data_filters_unemprate <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"BEL+AUS+AUT+CAN+DNK+FRA+DEU+GRC+HUN+IRL+ITA+JPN+NLD+NZL+NOR+PRT+SVN+ESP+SWE+CHE+USA+GBR+TUR..PT_LF_SUB._Z.Y._T.Y_GE15..M"</span></span>
<span id="cb5-2"></span>
<span id="cb5-3">df <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">get_dataset</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">dataset =</span> dataset_unemprate,</span>
<span id="cb5-4">                  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">filter =</span> data_filters_unemprate,</span>
<span id="cb5-5">                  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">start_time =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2014</span>)</span>
<span id="cb5-6"></span>
<span id="cb5-7"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">head</span>(df)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>  ACTIVITY ADJUSTMENT    AGE DECIMALS FREQ  MEASURE OBS_STATUS ObsValue
1       _Z          Y Y_GE15        1    M UNE_LF_M          A      3.7
2       _Z          Y Y_GE15        1    M UNE_LF_M          A      3.6
3       _Z          Y Y_GE15        1    M UNE_LF_M          A      3.7
4       _Z          Y Y_GE15        1    M UNE_LF_M          A      3.6
5       _Z          Y Y_GE15        1    M UNE_LF_M          A      3.6
6       _Z          Y Y_GE15        1    M UNE_LF_M          A      3.7
  REF_AREA SEX TIME_PERIOD TRANSFORMATION UNIT_MEASURE UNIT_MULT
1      JPN  _T     2014-01             _Z    PT_LF_SUB         0
2      JPN  _T     2014-02             _Z    PT_LF_SUB         0
3      JPN  _T     2014-03             _Z    PT_LF_SUB         0
4      JPN  _T     2014-04             _Z    PT_LF_SUB         0
5      JPN  _T     2014-05             _Z    PT_LF_SUB         0
6      JPN  _T     2014-06             _Z    PT_LF_SUB         0</code></pre>
</div>
</div>
</section>
</section>
<section id="using-the-rsdmx-package" class="level2">
<h2 class="anchored" data-anchor-id="using-the-rsdmx-package">Using the <code>rsdmx</code> Package</h2>
<p>The <code>rsdmx</code> package allows interaction with the OECD Developer API through <strong>SDMX format</strong>. It is particularly useful if you prefer working directly with API URLs.</p>
<section id="installing-the-rsdmx-package" class="level3">
<h3 class="anchored" data-anchor-id="installing-the-rsdmx-package">Installing the <code>rsdmx</code> Package</h3>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb7-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">install.packages</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"rsdmx"</span>)</span></code></pre></div></div>
</div>
<section id="key-functions-in-rsdmx" class="level4">
<h4 class="anchored" data-anchor-id="key-functions-in-rsdmx">Key Functions in <code>rsdmx</code></h4>
<ol type="1">
<li><p><strong><code>readSDMX()</code></strong>: Fetches data from an SDMX-compatible API endpoint.</p></li>
<li><p><strong><code>as.data.frame()</code></strong>: Converts the retrieved SDMX object into a data frame.</p></li>
</ol>
</section>
<section id="example-workflow-with-rsdmx" class="level4">
<h4 class="anchored" data-anchor-id="example-workflow-with-rsdmx">Example Workflow with <code>rsdmx</code></h4>
<p>Here’s how you can retrieve unemployment data:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb8-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Load the rsdmx package</span></span>
<span id="cb8-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(rsdmx)</span>
<span id="cb8-3"></span>
<span id="cb8-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Define the API URL for unemployment rates</span></span>
<span id="cb8-5">oecd_url <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"https://sdmx.oecd.org/public/rest/data/OECD.SDD.TPS,DSD_LFS@DF_IALFS_UNE_M,1.0/BEL+AUS+AUT+CAN+DNK+FRA+DEU+GRC+HUN+IRL+ITA+JPN+NLD+NZL+NOR+PRT+SVN+ESP+SWE+CHE+USA+GBR+TUR..PT_LF_SUB._Z.Y._T.Y_GE15..M?startPeriod=2023-11&amp;dimensionAtObservation=AllDimensions"</span></span>
<span id="cb8-6"></span>
<span id="cb8-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Step 1: Fetch the data</span></span>
<span id="cb8-8">unemployment_data <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">readSDMX</span>(oecd_url)</span>
<span id="cb8-9"></span>
<span id="cb8-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Step 2: Convert to a data frame</span></span>
<span id="cb8-11">unemployment_df <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">as.data.frame</span>(unemployment_data)</span>
<span id="cb8-12"></span>
<span id="cb8-13"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># View the data</span></span>
<span id="cb8-14"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">head</span>(unemployment_df)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>  TIME_PERIOD REF_AREA  MEASURE UNIT_MEASURE TRANSFORMATION ADJUSTMENT SEX
1     2023-11      JPN UNE_LF_M    PT_LF_SUB             _Z          Y  _T
2     2023-12      JPN UNE_LF_M    PT_LF_SUB             _Z          Y  _T
3     2024-01      JPN UNE_LF_M    PT_LF_SUB             _Z          Y  _T
4     2024-02      JPN UNE_LF_M    PT_LF_SUB             _Z          Y  _T
5     2024-03      JPN UNE_LF_M    PT_LF_SUB             _Z          Y  _T
6     2024-04      JPN UNE_LF_M    PT_LF_SUB             _Z          Y  _T
     AGE ACTIVITY FREQ obsValue UNIT_MULT DECIMALS OBS_STATUS
1 Y_GE15       _Z    M      2.6         0        1          A
2 Y_GE15       _Z    M      2.5         0        1          A
3 Y_GE15       _Z    M      2.5         0        1          A
4 Y_GE15       _Z    M      2.6         0        1          A
5 Y_GE15       _Z    M      2.6         0        1          A
6 Y_GE15       _Z    M      2.6         0        1          A</code></pre>
</div>
</div>
</section>
</section>
</section>
<section id="conclusion" class="level2">
<h2 class="anchored" data-anchor-id="conclusion">Conclusion</h2>
<p>Both <code>oecd</code> and <code>rsdmx</code> allow you to specify filters directly in your API request, which is critical for:</p>
<ol type="1">
<li><p><strong>Time Efficiency</strong>: Smaller, focused datasets download faster.</p></li>
<li><p><strong>Storage Optimization</strong>: Filtering minimizes the size of the retrieved dataset.</p></li>
<li><p><strong>Simpler Analysis</strong>: Pre-filtered data reduces the need for extensive preprocessing.</p></li>
</ol>
<p>When working with OECD databases in R, the updated version of the <code>oecd</code> package (available on GitHub) is a reliable choice, provided you install it from its GitHub repository. If you prefer working directly with API URLs, the <code>rsdmx</code> package is another strong option.</p>
<p>Regardless of the package, applying filters in your data requests is essential to ensure efficiency and reproducibility. By integrating these tools into your workflow, you can access OECD data programmatically and focus on the analysis itself.</p>
</section>
<section id="references" class="level2">
<h2 class="anchored" data-anchor-id="references">References</h2>
<ol type="1">
<li><p><a href="https://data-explorer.oecd.org/"><strong>OECD Data Explorer</strong></a></p></li>
<li><p><a href="https://www.oecd.org/en/data/insights/data-explainers/2024/09/api.html"><strong>OECD Data via API</strong></a></p></li>
<li><p><a href="https://github.com/expersso/OECD"><strong>Updated <code>oecd</code> Package on GitHub</strong></a></p></li>
<li><p><a href="https://cran.r-project.org/web/packages/rsdmx/index.html"><strong><code>rsdmx</code> Package Documentation</strong></a></p></li>
<li><p><a href="https://gitlab.algobank.oecd.org/public-documentation/dotstat-migration/-/raw/main/OECD_Data_API_documentation.pdf">OECD Data API documentation</a></p></li>
<li><p><a href="https://gitlab.algobank.oecd.org/public-documentation/dotstat-migration/-/raw/main/OECD_Data_API_documentation-Upgrading_from_the_legacy_OECD.Stat_APIs.pdf">Upgrading your queries from the legacy OECD.Stat APIs to the new OECD Data API</a></p></li>
</ol>


</section>

 ]]></description>
  <category>R Programming</category>
  <category>OECD</category>
  <category>rsdmx</category>
  <category>Import</category>
  <category>API</category>
  <guid>https://mfatihtuzen.github.io/posts/2024-12-16_oecd/</guid>
  <pubDate>Mon, 16 Dec 2024 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Creating Professional Excel Reports with R: A Comprehensive Guide to openxlsx Package</title>
  <dc:creator>M. Fatih Tüzen</dc:creator>
  <link>https://mfatihtuzen.github.io/posts/2024-11-04_openxlsx/</link>
  <description><![CDATA[ 






<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><a href="https://ycphs.github.io/openxlsx/"><img src="https://mfatihtuzen.github.io/posts/2024-11-04_openxlsx/openxlsx.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></a></p>
</figure>
</div>
<section id="introduction" class="level2">
<h2 class="anchored" data-anchor-id="introduction">Introduction</h2>
<p>The ability to generate professional Excel reports programmatically is a crucial skill in data analysis and business reporting. In this comprehensive guide, we’ll explore how to use the <code>openxlsx</code> package in R to create sophisticated Excel reports with multiple sheets, custom formatting, and visualizations. This tutorial is designed for beginners to intermediate R users who want to automate their reporting workflows.</p>
</section>
<section id="why-choose-openxlsx" class="level2">
<h2 class="anchored" data-anchor-id="why-choose-openxlsx">Why Choose openxlsx?</h2>
<ul>
<li><p><strong>No Excel Dependency</strong>: Unlike some alternatives, openxlsx doesn’t require Excel installation and No Java dependency (unlike XLConnect)</p></li>
<li><p><strong>Performance</strong>: Efficient handling of large datasets</p></li>
<li><p><strong>Comprehensive Formatting</strong>: Extensive options for cell styling, merging, and formatting</p></li>
<li><p><strong>Multiple Worksheets</strong>: Easy management of multiple sheets in a workbook</p></li>
<li><p><strong>Custom Styles</strong>: Ability to create and apply custom styles</p></li>
<li><p><strong>Memory Efficient</strong>: Better memory management compared to other packages</p></li>
<li><p><strong>Active Development</strong>: Regular updates and community support</p></li>
</ul>
</section>
<section id="getting-started" class="level2">
<h2 class="anchored" data-anchor-id="getting-started">Getting Started</h2>
<p>First, install and load the required packages:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Load packages</span></span>
<span id="cb1-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(openxlsx)</span>
<span id="cb1-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(dplyr)</span>
<span id="cb1-4"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(ggplot2)</span></code></pre></div></div>
</div>
</section>
<section id="basic-functions-and-their-arguments" class="level2">
<h2 class="anchored" data-anchor-id="basic-functions-and-their-arguments">Basic Functions and Their Arguments</h2>
<section id="core-functions" class="level3">
<h3 class="anchored" data-anchor-id="core-functions">Core Functions</h3>
<p><strong><code>createWorkbook()</code></strong></p>
<p>The <code>createWorkbook()</code> function is just the starting point and creates a new workbook object. When you run <code>wb &lt;- createWorkbook()</code>, you are creating a new, empty workbook object and assigning it to the variable <code>wb</code>. This workbook will serve as the container for any worksheets, styles, and data you want to add before saving it as an Excel file.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb2-1">wb <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">createWorkbook</span>()</span></code></pre></div></div>
</div>
<p><strong><code>addWorksheet()</code></strong></p>
<p>The <strong><code>addWorksheet()</code></strong> function, part of the <strong>openxlsx</strong> package in R, is used to add a new worksheet (tab) to an Excel workbook created with <code>createWorkbook()</code>.</p>
<p>Key arguments:</p>
<ul>
<li><p><strong><code>wb</code></strong>: This is the workbook object to which you’re adding a new worksheet. It should be an existing workbook created with <code>createWorkbook()</code>.</p></li>
<li><p><strong><code>sheetName = "Sales Report"</code></strong>: This argument specifies the name of the new worksheet. In this case, the sheet will be labeled “Sales Report.” The name you choose will appear as the worksheet tab name in the Excel file.</p></li>
<li><p><strong><code>gridLines = TRUE</code></strong>: This argument controls whether gridlines are visible in the worksheet.</p>
<ul>
<li><p><strong><code>TRUE</code></strong>: Shows gridlines (default setting).</p></li>
<li><p><strong><code>FALSE</code></strong>: Hides gridlines, which can create a cleaner look in some reports.</p></li>
</ul></li>
</ul>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb3-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">addWorksheet</span>(wb, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sheetName =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Sales Report"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">gridLines =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>)</span></code></pre></div></div>
</div>
<p><strong><code>writeData()</code></strong></p>
<p>The <code>writeData()</code> function from the <strong>openxlsx</strong> package in R is used to add data to a specific worksheet in an Excel workbook. Here’s what each argument in your code does:</p>
<ul>
<li><p><strong><code>wb</code></strong>: This is the workbook object where you want to write data. The workbook should already be created using <code>createWorkbook()</code>.</p></li>
<li><p><strong><code>sheet = 1</code></strong>: This specifies the sheet to which you’re writing data. Here, <code>1</code> refers to the first sheet in the workbook. You can also use the sheet’s name (e.g., <code>sheet = "Sales Report"</code>) if you prefer.</p></li>
<li><p><strong><code>x = data</code></strong>: This is the data you want to write to the worksheet. <code>data</code> can be a data frame, matrix, or vector.</p></li>
<li><p><strong><code>startRow = 1</code></strong>: This specifies the row in the worksheet where the data should start. In this case, data will be written beginning at the first row.</p></li>
<li><p><strong><code>startCol = 1</code></strong>: This specifies the column where the data should start. Setting this to <code>1</code> will write data starting from the first column (column “A” in Excel).</p></li>
</ul>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb4-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">writeData</span>(wb, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sheet =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> data, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">startRow =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">startCol =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)</span></code></pre></div></div>
</div>
</section>
</section>
<section id="step-by-step-report-creation" class="level2">
<h2 class="anchored" data-anchor-id="step-by-step-report-creation">Step-by-Step Report Creation</h2>
<p>Let’s create a sample sales report with multiple sheets, formatting, and charts.</p>
<section id="step-1-prepare-sample-data" class="level3">
<h3 class="anchored" data-anchor-id="step-1-prepare-sample-data">Step 1: Prepare Sample Data</h3>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb5-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Create sample sales data</span></span>
<span id="cb5-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">set.seed</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">123</span>)</span>
<span id="cb5-3">sales_data <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(</span>
<span id="cb5-4">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">Date =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">seq.Date</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">as.Date</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"2023-01-01"</span>), <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">as.Date</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"2023-12-31"</span>), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">by =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"month"</span>),</span>
<span id="cb5-5">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">Region =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rep</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"North"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"South"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"East"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"West"</span>), <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>),</span>
<span id="cb5-6">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">Sales =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">round</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">runif</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">12</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10000</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">50000</span>), <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>),</span>
<span id="cb5-7">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">Units =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">round</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">runif</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">12</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">500</span>)),</span>
<span id="cb5-8">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">Profit =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">round</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">runif</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">12</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5000</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">25000</span>), <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>)</span>
<span id="cb5-9">)</span>
<span id="cb5-10"></span>
<span id="cb5-11">sales_data</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>         Date Region    Sales Units   Profit
1  2023-01-01  North 21503.10   371 18114.12
2  2023-02-01  South 41532.21   329 19170.61
3  2023-03-01   East 26359.08   141 15881.32
4  2023-04-01   West 45320.70   460 16882.84
5  2023-05-01  North 47618.69   198 10783.19
6  2023-06-01  South 11822.26   117  7942.27
7  2023-07-01   East 31124.22   231 24260.48
8  2023-08-01   West 45696.76   482 23045.98
9  2023-09-01  North 32057.40   456 18814.11
10 2023-10-01  South 28264.59   377 20909.35
11 2023-11-01   East 48273.33   356  5492.27
12 2023-12-01   West 28133.37   498 14555.92</code></pre>
</div>
</div>
<ul>
<li><p><strong><code>set.seed(123)</code></strong>: This sets the random seed to ensure that any randomly generated numbers in the code are reproducible. This is useful if you want to get the same “random” values each time you run the code.</p></li>
<li><p><strong><code>sales_data &lt;- data.frame(...)</code></strong>: This creates a data frame called <code>sales_data</code> to store the sample sales data. A data frame is a table-like structure in R, suitable for storing datasets.</p></li>
<li><p><strong><code>Date = seq.Date(...)</code></strong>: <code>seq.Date()</code> generates a sequence of dates from January 1, 2023, to December 31, 2023, with one date per month.</p>
<ul>
<li><p><code>as.Date("2023-01-01")</code> and <code>as.Date("2023-12-31")</code> define the start and end dates for the sequence.</p></li>
<li><p><code>by = "month"</code> specifies that the sequence should increment by one month at a time, creating 12 monthly date entries.</p></li>
</ul></li>
<li><p><strong><code>Region = rep(c("North", "South", "East", "West"), 3)</code></strong>: <code>rep(c("North", "South", "East", "West"), 3)</code> repeats the four regions (“North”, “South”, “East”, “West”) three times to get a total of 12 values. This column will indicate which region each data entry corresponds to.</p></li>
<li><p><strong><code>Sales = round(runif(12, 10000, 50000), 2)</code></strong>:</p>
<ul>
<li><p><code>runif(12, 10000, 50000)</code> generates 12 random numbers between 10,000 and 50,000, representing the monthly sales figures.</p></li>
<li><p><code>round(..., 2)</code> rounds these sales figures to two decimal places for readability.</p></li>
</ul></li>
<li><p><strong><code>Units = round(runif(12, 100, 500))</code></strong>:</p>
<ul>
<li><p><code>runif(12, 100, 500)</code> generates 12 random integers between 100 and 500, representing the number of units sold each month.</p></li>
<li><p><code>round()</code> rounds these values to the nearest whole number.</p></li>
</ul></li>
<li><p><strong><code>Profit = round(runif(12, 5000, 25000), 2)</code></strong>:</p>
<ul>
<li><p><code>runif(12, 5000, 25000)</code> generates 12 random numbers between 5,000 and 25,000, representing monthly profit values.</p></li>
<li><p><code>round(..., 2)</code> rounds each profit value to two decimal places.</p></li>
</ul></li>
</ul>
</section>
<section id="step-2-create-workbook-and-add-sheets" class="level3">
<h3 class="anchored" data-anchor-id="step-2-create-workbook-and-add-sheets">Step 2: Create Workbook and Add Sheets</h3>
<p>Following code creates an Excel workbook and prepares it with several worksheets and customized styles for titles and headers. Let’s walk through each part.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb7-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Create new workbook</span></span>
<span id="cb7-2">wb <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">createWorkbook</span>()</span></code></pre></div></div>
</div>
<p>This line initializes a new workbook object (<code>wb</code>) where you’ll add worksheets and data. The workbook is created using <code>createWorkbook()</code> from the <strong>openxlsx</strong> package.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb8-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Add worksheets</span></span>
<span id="cb8-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">addWorksheet</span>(wb, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Summary"</span>)</span>
<span id="cb8-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">addWorksheet</span>(wb, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Details"</span>)</span>
<span id="cb8-4"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">addWorksheet</span>(wb, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Charts"</span>)</span></code></pre></div></div>
</div>
<p>These lines add three worksheets to the workbook, named “Summary,” “Details,” and “Charts.” Each worksheet will be a separate tab in the Excel file.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb9-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Create a title style</span></span>
<span id="cb9-2">title_style <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">createStyle</span>(</span>
<span id="cb9-3">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">fontSize =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">14</span>,</span>
<span id="cb9-4">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">fontColour =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#FFFFFF"</span>,</span>
<span id="cb9-5">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">halign =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"center"</span>,</span>
<span id="cb9-6">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">fgFill =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#4F81BD"</span>,</span>
<span id="cb9-7">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">textDecoration =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"bold"</span>,</span>
<span id="cb9-8">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">border =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"TopBottom"</span>,</span>
<span id="cb9-9">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">borderColour =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#4F81BD"</span></span>
<span id="cb9-10">)</span></code></pre></div></div>
</div>
<ul>
<li><strong><code>createStyle()</code></strong>: This function defines a custom style that you can apply to specific cells in the workbook. The style here is designed for titles and is stored in <code>title_style</code>.</li>
</ul>
<section id="arguments-in-createstyle-for-the-title" class="level4">
<h4 class="anchored" data-anchor-id="arguments-in-createstyle-for-the-title">Arguments in <code>createStyle()</code> for the Title:</h4>
<ul>
<li><p><strong><code>fontSize = 14</code></strong>: Sets the font size to 14 for better visibility of the title.</p></li>
<li><p><strong><code>fontColour = "#FFFFFF"</code></strong>: Sets the font color to white, using a hexadecimal color code.</p></li>
<li><p><strong><code>halign = "center"</code></strong>: Horizontally aligns the text to the center within the cell.</p></li>
<li><p><strong><code>fgFill = "#4F81BD"</code></strong>: Sets the background fill color (foreground color) of the cell to a shade of blue (<code>#4F81BD</code>).</p></li>
<li><p><strong><code>textDecoration = "bold"</code></strong>: Makes the text bold to emphasize it as a title.</p></li>
<li><p><strong><code>border = "TopBottom"</code></strong>: Adds borders to the top and bottom of the cell to give the title a framed appearance.</p></li>
<li><p><strong><code>borderColour = "#4F81BD"</code></strong>: Sets the color of the borders to match the blue fill color.</p></li>
</ul>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb10" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb10-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Create header style</span></span>
<span id="cb10-2">header_style <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">createStyle</span>(</span>
<span id="cb10-3">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">fontSize =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">12</span>,</span>
<span id="cb10-4">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">fontColour =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#000000"</span>,</span>
<span id="cb10-5">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">halign =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"center"</span>,</span>
<span id="cb10-6">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">fgFill =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#DCE6F1"</span>,</span>
<span id="cb10-7">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">textDecoration =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"bold"</span>,</span>
<span id="cb10-8">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">border =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"bottom"</span>,</span>
<span id="cb10-9">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">borderColour =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#4F81BD"</span></span>
<span id="cb10-10">)</span></code></pre></div></div>
</div>
<ul>
<li>This style is designed for headers in the worksheets, stored in <code>header_style</code>.</li>
</ul>
</section>
<section id="arguments-in-createstyle-for-the-header" class="level4">
<h4 class="anchored" data-anchor-id="arguments-in-createstyle-for-the-header">Arguments in <code>createStyle()</code> for the Header:</h4>
<ul>
<li><p><strong><code>fontSize = 12</code></strong>: Sets a slightly smaller font size than the title.</p></li>
<li><p><strong><code>fontColour = "#000000"</code></strong>: Sets the font color to black.</p></li>
<li><p><strong><code>halign = "center"</code></strong>: Centers the text within each cell.</p></li>
<li><p><strong><code>fgFill = "#DCE6F1"</code></strong>: Sets a light blue background fill for the header cells to distinguish them visually.</p></li>
<li><p><strong><code>textDecoration = "bold"</code></strong>: Makes the header text bold.</p></li>
<li><p><strong><code>border = "bottom"</code></strong>: Adds a border to the bottom of the cell.</p></li>
<li><p><strong><code>borderColour = "#4F81BD"</code></strong>: Sets the color of the bottom border to the same blue as in the title style.</p></li>
</ul>
</section>
</section>
<section id="step-3-add-summary-data-and-formatting" class="level3">
<h3 class="anchored" data-anchor-id="step-3-add-summary-data-and-formatting">Step 3: Add Summary Data and Formatting</h3>
<p>This code adds a formatted title and data summary to the “Summary” worksheet in an Excel workbook, then applies styling to headers and numeric data, and adjusts column widths for a polished appearance. Let’s go through each section.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb11" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb11-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Write title</span></span>
<span id="cb11-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">writeData</span>(wb, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Summary"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Sales Performance Report 2023"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">startCol =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">startRow =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)</span>
<span id="cb11-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mergeCells</span>(wb, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Summary"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">cols =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">rows =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)</span>
<span id="cb11-4"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">addStyle</span>(wb, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Summary"</span>, title_style, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">rows =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">cols =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>)</span></code></pre></div></div>
</div>
<ul>
<li><p><strong><code>writeData(wb, "Summary", "Sales Performance Report 2023", startCol = 1, startRow = 1)</code></strong>: This places the text <code>"Sales Performance Report 2023"</code> in cell A1 of the “Summary” worksheet.</p></li>
<li><p><strong><code>mergeCells(wb, "Summary", cols = 1:5, rows = 1)</code></strong>: Merges cells from columns 1 to 5 (A to E) in the first row, centering the title across these columns to make it look like a unified title.</p></li>
<li><p><strong><code>addStyle(wb, "Summary", title_style, rows = 1, cols = 1:5)</code></strong>: Applies the previously defined <code>title_style</code> to the merged title cell. This style includes formatting like font size, color, alignment, and borders, giving the title a professional appearance.</p></li>
</ul>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb12" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb12-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Write data with headers</span></span>
<span id="cb12-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">writeData</span>(wb, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Summary"</span>, sales_data, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">startCol =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">startRow =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>)</span>
<span id="cb12-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">addStyle</span>(wb, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Summary"</span>, header_style, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">rows =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">cols =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>)</span></code></pre></div></div>
</div>
<ul>
<li><p><strong><code>writeData(wb, "Summary", sales_data, startCol = 1, startRow = 3)</code></strong>: Writes the <code>sales_data</code> data frame starting from cell A3. Row 3 will contain the headers from <code>sales_data</code>, while the rows below will contain the data.</p></li>
<li><p><strong><code>addStyle(wb, "Summary", header_style, rows = 3, cols = 1:5)</code></strong>: Applies the <code>header_style</code> to row 3 (columns A to E) to make the headers bold, centered, and colored with a background fill. This improves readability and distinguishes the headers from the data.</p></li>
</ul>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb13" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb13-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Format numbers</span></span>
<span id="cb13-2">number_style <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">createStyle</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">numFmt =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#,##0.00"</span>)</span>
<span id="cb13-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">addStyle</span>(wb, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Summary"</span>, number_style, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">rows =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">15</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">cols =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">gridExpand =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>)</span></code></pre></div></div>
</div>
<ul>
<li><p><strong><code>number_style &lt;- createStyle(numFmt = "#,##0.00")</code></strong>: Defines a style named <code>number_style</code> that formats numbers with commas as thousands separators and two decimal places (e.g., <code>12,345.67</code>).</p></li>
<li><p><strong><code>addStyle(wb, "Summary", number_style, rows = 4:15, cols = 3:5, gridExpand = TRUE)</code></strong>:</p>
<ul>
<li><p>Applies this <code>number_style</code> to columns 3 through 5 (Sales, Units, and Profit columns in <code>sales_data</code>) for rows 4 to 15, covering all data rows.</p></li>
<li><p><strong><code>gridExpand = TRUE</code></strong> ensures the style applies to the entire specified range, not just the first cell in each row or column.</p></li>
</ul></li>
</ul>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb14" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb14-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Adjust column widths</span></span>
<span id="cb14-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">setColWidths</span>(wb, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Summary"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">cols =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">widths =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"auto"</span>)</span></code></pre></div></div>
</div>
<p><strong><code>setColWidths(wb, "Summary", cols = 1:5, widths = "auto")</code></strong>: Automatically adjusts the widths of columns 1 through 5 (A to E) based on their content. This ensures that all data, headers, and titles are fully visible without manual adjustment.</p>
</section>
<section id="step-4-create-and-add-visualizations" class="level3">
<h3 class="anchored" data-anchor-id="step-4-create-and-add-visualizations">Step 4: Create and Add Visualizations</h3>
<p>This code creates a line chart to visualize monthly sales trends and inserts it into an Excel workbook. Here’s a step-by-step explanation of each part.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb15" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb15-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Create monthly sales trend chart</span></span>
<span id="cb15-2">sales_plot <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggplot</span>(sales_data, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> Date, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> Sales)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb15-3">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_line</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">color =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#4F81BD"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">size =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.2</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb15-4">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_point</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">color =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#4F81BD"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">size =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb15-5">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme_minimal</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb15-6">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">labs</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">title =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Monthly Sales Trend"</span>,</span>
<span id="cb15-7">       <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Month"</span>,</span>
<span id="cb15-8">       <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Sales ($)"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb15-9">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">plot.title =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">element_text</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">hjust =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">size =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">14</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">face =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"bold"</span>))</span></code></pre></div></div>
<div class="cell-output cell-output-stderr">
<pre><code>Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.</code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb17" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb17-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Save plot to a temporary image file</span></span>
<span id="cb17-2">img_file <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">tempfile</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">fileext =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">".png"</span>)</span>
<span id="cb17-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggsave</span>(</span>
<span id="cb17-4">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">filename =</span> img_file,</span>
<span id="cb17-5">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">plot =</span> sales_plot,</span>
<span id="cb17-6">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">width =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span>,</span>
<span id="cb17-7">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">height =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">6</span>,</span>
<span id="cb17-8">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">units =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"in"</span>,</span>
<span id="cb17-9">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">dpi =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">300</span></span>
<span id="cb17-10">)</span>
<span id="cb17-11"></span>
<span id="cb17-12"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> (<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">file.exists</span>(img_file)) {</span>
<span id="cb17-13">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">stop</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">paste</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Plot image file was not created:"</span>, img_file))</span>
<span id="cb17-14">}</span>
<span id="cb17-15"></span>
<span id="cb17-16"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Insert saved image into workbook</span></span>
<span id="cb17-17">openxlsx<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">insertImage</span>(</span>
<span id="cb17-18">  wb,</span>
<span id="cb17-19">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sheet =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Charts"</span>,</span>
<span id="cb17-20">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">file =</span> img_file,</span>
<span id="cb17-21">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">startCol =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>,</span>
<span id="cb17-22">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">startRow =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>,</span>
<span id="cb17-23">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">width =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span>,</span>
<span id="cb17-24">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">height =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">6</span>,</span>
<span id="cb17-25">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">units =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"in"</span></span>
<span id="cb17-26">)</span></code></pre></div></div>
</div>
<p><strong><code>ggsave()</code></strong> and <strong><code>insertImage()</code></strong> functions are used together to export a plot and place it into an Excel worksheet.</p>
<ul>
<li><p><strong><code>ggsave()</code></strong>: Saves the <code>sales_plot</code> object as an image file (in this case, a temporary PNG file). This ensures that the plot is explicitly created and available for further use, especially in non-interactive environments.</p></li>
<li><p><strong><code>img_file &lt;- tempfile(fileext = ".png")</code></strong>: Creates a temporary file path where the plot image will be stored.</p></li>
<li><p><strong><code>file.exists(img_file)</code></strong>: Checks whether the image file has been successfully created before attempting to insert it into the workbook.</p></li>
<li><p><strong><code>insertImage()</code></strong>: An <strong>openxlsx</strong> function used to insert an external image file into an Excel worksheet.</p>
<ul>
<li><strong><code>wb</code></strong>: Specifies the workbook to insert the image into.</li>
<li><strong><code>sheet = "Charts"</code></strong>: Specifies the worksheet where the image will be placed.</li>
<li><strong><code>file = img_file</code></strong>: Provides the path to the saved plot image.</li>
<li><strong><code>startCol = 1, startRow = 1</code></strong>: Inserts the image starting at cell A1 of the “Charts” worksheet.</li>
<li><strong><code>width = 8, height = 6</code></strong>: Sets the width and height of the image in inches.</li>
</ul></li>
</ul>
</section>
<section id="step-5-add-regional-analysis" class="level3">
<h3 class="anchored" data-anchor-id="step-5-add-regional-analysis">Step 5: Add Regional Analysis</h3>
<p>Then let’s create a summary of sales data by region, writes it to the “Details” worksheet in an Excel workbook, and applies styling for a professional presentation.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb18" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb18-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Create regional summary</span></span>
<span id="cb18-2">regional_summary <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> sales_data <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb18-3">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">group_by</span>(Region) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb18-4">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summarise</span>(</span>
<span id="cb18-5">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">Total_Sales =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sum</span>(Sales),</span>
<span id="cb18-6">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">Avg_Units =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(Units),</span>
<span id="cb18-7">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">Total_Profit =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sum</span>(Profit)</span>
<span id="cb18-8">  )</span>
<span id="cb18-9"></span>
<span id="cb18-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Write regional summary to Details sheet</span></span>
<span id="cb18-11"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">writeData</span>(wb, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Details"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Regional Performance Summary"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">startCol =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">startRow =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)</span>
<span id="cb18-12"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mergeCells</span>(wb, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Details"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">cols =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">rows =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)</span>
<span id="cb18-13"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">addStyle</span>(wb, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Details"</span>, title_style, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">rows =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">cols =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>)</span>
<span id="cb18-14"></span>
<span id="cb18-15"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">writeData</span>(wb, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Details"</span>, regional_summary, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">startCol =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">startRow =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>)</span>
<span id="cb18-16"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">addStyle</span>(wb, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Details"</span>, header_style, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">rows =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">cols =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>)</span></code></pre></div></div>
</div>
</section>
<section id="step-6-save-the-workbook" class="level3">
<h3 class="anchored" data-anchor-id="step-6-save-the-workbook">Step 6: Save the Workbook</h3>
<p>Lastly with this command finalizes and exports the workbook, preserving all worksheets, data, formatting, and charts created in previous steps. You should see a file named <code>Sales_Report_2023.xlsx</code> in your working directory after this line runs.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb19" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb19-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Save the workbook</span></span>
<span id="cb19-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">saveWorkbook</span>(wb, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Sales_Report_2023.xlsx"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">overwrite =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>)</span></code></pre></div></div>
</div>
<p>After saving the Excel file with the <code>Summary</code>, <code>Details</code>, and <code>Charts</code> sheets, I opened the file to review the output. Below, I’m sharing screenshots of each sheet to showcase the final report layout, formatting, and visualization.</p>
<p>In the <strong>Summary</strong> sheet, you can see the main title, followed by a detailed table with the monthly sales data. The headers and values are formatted to improve readability and create a professional appearance.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://mfatihtuzen.github.io/posts/2024-11-04_openxlsx/Summary.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>The <strong>Details</strong> sheet provides a regional breakdown with aggregated sales, average units, and profit for each region. This sheet includes formatted headers and a clear, centered title, making it easy to interpret the regional performance metrics.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://mfatihtuzen.github.io/posts/2024-11-04_openxlsx/Details.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>Lastly, the <strong>Charts</strong> sheet contains a line graph displaying the monthly sales trend. This visualization is useful for spotting sales patterns and seeing how performance changes over the months.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://mfatihtuzen.github.io/posts/2024-11-04_openxlsx/Charts.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>These screenshots illustrate the powerful formatting and customization options available when generating Excel reports in R, making it straightforward to create polished and informative workbooks for reporting.</p>
</section>
</section>
<section id="best-practices-and-tips-for-using-the-openxlsx-package-in-r" class="level2">
<h2 class="anchored" data-anchor-id="best-practices-and-tips-for-using-the-openxlsx-package-in-r">Best Practices and Tips for Using the <code>openxlsx</code> Package in R</h2>
<ol type="1">
<li><p><strong>Use Meaningful Sheet Names</strong><br>
Choose descriptive and relevant names for your Excel sheets. This helps users understand the content at a glance and enhances navigation within the workbook. For example, instead of generic names like “Sheet1,” use names like “SalesData_Q1” or “CustomerFeedback.”</p></li>
<li><p><strong>Implement Consistent Styling Across Sheets</strong><br>
Maintain a uniform style throughout your workbook to enhance readability and professionalism. Use consistent fonts, colors, and cell styles. You can set styles using the <code>createStyle()</code> function and apply them to multiple sheets to ensure uniformity.</p></li>
<li><p><strong>Include Proper Documentation in Your Code</strong><br>
Document your R code with clear comments explaining the purpose of each section and any specific styling or formatting choices made with the <code>openxlsx</code> functions. This will make your code easier to understand and maintain, especially for others who may work with it later.</p></li>
<li><p><strong>Use Appropriate Number Formatting for Different Data Types</strong><br>
Apply relevant number formats for various data types, such as currency, percentages, or dates. Utilize the <code>addStyle()</code> function to format cells appropriately, which improves data clarity and presentation in your reports.</p></li>
<li><p><strong>Test the Report with Different Data Sizes</strong><br>
Before finalizing your report, test it with datasets of varying sizes to ensure it renders correctly and performs well. This will help you identify any potential issues, such as layout problems or performance slowdowns, before distribution.</p></li>
<li><p><strong>Include Error Handling for Robust Reports</strong><br>
Implement error handling in your R code to gracefully manage potential issues, such as missing data or formatting errors. Use <code>tryCatch()</code> to catch errors during report generation, ensuring that your report generation process is robust and user-friendly.</p></li>
</ol>
</section>
<section id="conclusion" class="level2">
<h2 class="anchored" data-anchor-id="conclusion">Conclusion</h2>
<p>The <code>openxlsx</code> package is a powerful and flexible tool for generating professional Excel reports directly from R. By leveraging its capabilities, you can create sophisticated reports that include multiple sheets, tailored formatting, and integrated visualizations. This package allows for extensive customization, enabling you to apply styles, set column widths, and format numbers to meet your specific requirements.</p>
<p>As you create your reports, take advantage of features such as conditional formatting, data validation, and the ability to add hyperlinks. These functionalities can enhance the interactivity and usability of your reports, making them not only visually appealing but also more functional.</p>
<p>Don’t hesitate to experiment with various formatting options, as <code>openxlsx</code> offers a range of functions to help you manipulate the appearance of your sheets. Adapting the code to fit your reporting needs is crucial; consider how you can automate repetitive tasks or incorporate dynamic elements that reflect changes in your data.</p>
<p>Additionally, always keep performance in mind—testing your reports with datasets of varying sizes will ensure that they function smoothly and remain responsive, regardless of the data complexity. Finally, robust error handling will help you create reliable reports that can withstand unexpected data issues, thereby enhancing the user experience.</p>
<p>By following the best practices outlined in this guide, you will be well-equipped to utilize the <code>openxlsx</code> package to its fullest potential, producing high-quality, professional reports that effectively communicate your insights and findings.</p>
</section>
<section id="about-openxlsx2-package" class="level2">
<h2 class="anchored" data-anchor-id="about-openxlsx2-package">About <code>openxlsx2</code> Package</h2>
<p>While <code>openxlsx</code> is a powerful package for Excel reporting, its successor, <strong><code>openxlsx2</code></strong>, brings significant enhancements and additional features:</p>
<ol type="1">
<li><p><strong>Improved Performance</strong>:<br>
<code>openxlsx2</code> is optimized for speed and efficiency, making it faster when handling large datasets or generating complex Excel files.</p></li>
<li><p><strong>Enhanced Compatibility</strong>:<br>
The package offers better compatibility with modern Excel formats and supports advanced features such as conditional formatting and improved table styles.</p></li>
<li><p><strong>Simplified Syntax</strong>:<br>
Functions in <code>openxlsx2</code> have been refined for easier use, with clearer argument names and enhanced documentation.</p></li>
<li><p><strong>Backward Compatibility</strong>:<br>
<code>openxlsx2</code> maintains most of the functionality of <code>openxlsx</code>, allowing users to transition seamlessly while benefiting from the new features.</p></li>
</ol>
<p>For users who require advanced functionality or improved performance, <code>openxlsx2</code> is an excellent alternative. You can explore the package and its documentation on <a href="https://cran.r-project.org/web/packages/openxlsx2/index.html">CRAN</a> and <a href="https://github.com/JanMarvin/openxlsx2">github</a>.</p>
</section>
<section id="references" class="level2">
<h2 class="anchored" data-anchor-id="references">References</h2>
<ol type="1">
<li><p><strong>openxlsx GitHub Repository</strong><br>
Explore the source code, issues, and development updates for the <code>openxlsx</code> package. Available at: <a href="https://github.com/ycphs/openxlsx">openxlsx GitHubRepository</a></p></li>
<li><p><strong>openxlsx Documentation</strong><br>
Access the official documentation for detailed information on functions, usage, and examples for the <code>openxlsx</code> package. Available at: <a href="https://ycphs.github.io/openxlsx/">openxlsx Documentation</a></p></li>
<li><p><strong>CRAN Package Page</strong><br>
Find installation instructions, news, and package information from the Comprehensive R Archive Network (CRAN). Available at: <a href="https://cran.r-project.org/web/packages/openxlsx/openxlsx.pdf">openxlsx CRAN Page</a></p></li>
</ol>


</section>

 ]]></description>
  <category>R Programming</category>
  <category>Report Automation</category>
  <category>openxlsx</category>
  <category>Excel</category>
  <guid>https://mfatihtuzen.github.io/posts/2024-11-04_openxlsx/</guid>
  <pubDate>Mon, 04 Nov 2024 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Mastering Date and Time Data in R with lubridate</title>
  <dc:creator>M. Fatih Tüzen</dc:creator>
  <link>https://mfatihtuzen.github.io/posts/2024-09-30_lubridate/</link>
  <description><![CDATA[ 






<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><a href="https://allisonhorst.com/r-packages-functions"><img src="https://mfatihtuzen.github.io/posts/2024-09-30_lubridate/lubridate.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></a></p>
</figure>
</div>
<figcaption>Artwork by: Allison Horst</figcaption>
</figure>
</div>
<section id="what-is-lubridate" class="level2">
<h2 class="anchored" data-anchor-id="what-is-lubridate">What is lubridate?</h2>
<p><strong>lubridate</strong> is a powerful and widely-used package in the <strong>tidyverse</strong> ecosystem, specifically designed for making date-time manipulation in R both easier and more intuitive. It was created to address the common difficulties users face when working with dates and times, which are often stored in a variety of inconsistent formats or require complex arithmetic operations.</p>
<p>Developed and maintained by the <strong>RStudio</strong> team as part of the tidyverse collection of packages, <strong>lubridate</strong> introduces a simpler syntax for parsing, extracting, and manipulating date-time data, allowing for faster and more accurate operations.</p>
<p>Key benefits of using <strong>lubridate</strong> include:</p>
<ul>
<li><p><strong>Simplified parsing</strong> of dates and times from a wide variety of formats.</p></li>
<li><p><strong>Easy extraction</strong> of components such as year, month, day, or hour from date-time objects.</p></li>
<li><p><strong>Seamless handling of time zones</strong>, allowing conversion between different zones with ease.</p></li>
<li><p><strong>Efficient arithmetic operations</strong> on dates, such as adding or subtracting days, months, or years.</p></li>
<li><p><strong>Support for durations and intervals</strong>, crucial for working with time spans in real-world applications.</p></li>
</ul>
<p>For further documentation, tutorials, and resources, you can explore the <strong>lubridate</strong> official website: <a href="https://lubridate.tidyverse.org" class="uri">https://lubridate.tidyverse.org</a>.</p>
</section>
<section id="introduction-to-date-and-time-formats" class="level2">
<h2 class="anchored" data-anchor-id="introduction-to-date-and-time-formats">Introduction to Date and Time Formats</h2>
<p>Date and time data are essential in many fields, from finance and biology to web analytics and logistics. However, handling such data can be difficult due to the variety of formats and time zones involved. In R, base functions like <code>as.Date()</code> or <code>strptime()</code> can handle date-time data, but their syntax can be cumbersome when dealing with multiple formats or time zones.</p>
<p>The <strong>lubridate</strong> package simplifies these tasks by offering intuitive functions that handle date-time data efficiently, helping us avoid many of the common pitfalls associated with date and time manipulation.</p>
</section>
<section id="why-do-we-need-lubridate" class="level2">
<h2 class="anchored" data-anchor-id="why-do-we-need-lubridate">Why Do We Need lubridate?</h2>
<p>While R provides several built-in functions for date-time manipulation, they can quickly become limited or difficult to use in more complex scenarios. The <strong>lubridate</strong> package provides solutions by:</p>
<ul>
<li><p>Offering intuitive functions to parse and format dates.</p></li>
<li><p>Supporting a variety of date-time formats in a single command.</p></li>
<li><p>Simplifying the extraction and modification of date-time components (like year, month, or hour).</p></li>
<li><p>Facilitating the handling of time zones, durations, and intervals.</p></li>
</ul>
</section>
<section id="date-and-time-formats-in-r" class="level2">
<h2 class="anchored" data-anchor-id="date-and-time-formats-in-r">Date and Time Formats in R</h2>
<p>In R, dates are typically stored in <code>Date</code> format (which does not include time information), while date-time data is stored in <code>POSIXct</code> or <code>POSIXlt</code> formats. These formats support timestamps and can handle time zones. For example:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1">date_example <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">as.Date</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"2024-09-30"</span>)</span>
<span id="cb1-2">date_example</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] "2024-09-30"</code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb3-1">datetime_example <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">as.POSIXct</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"2024-09-30 14:45:00"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">tz =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"UTC"</span>)</span>
<span id="cb3-2">datetime_example</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] "2024-09-30 14:45:00 UTC"</code></pre>
</div>
</div>
<p>These formats work well for simple tasks but quickly become difficult to manage in more complex scenarios. That’s where <strong>lubridate</strong> steps in.</p>
</section>
<section id="common-lubridate-functions-and-their-arguments" class="level2">
<h2 class="anchored" data-anchor-id="common-lubridate-functions-and-their-arguments">Common lubridate Functions and Their Arguments</h2>
<section id="parsing-dates-and-times" class="level3">
<h3 class="anchored" data-anchor-id="parsing-dates-and-times"><strong>Parsing Dates and Times</strong></h3>
<p>One of the core strengths of <strong>lubridate</strong> is its ability to simplify the parsing of date and time data from various formats. Functions like <code>ymd()</code>, <code>mdy()</code>, <code>dmy()</code>, and their date-time counterparts (<code>ymd_hms()</code>, <code>mdy_hms()</code>, etc.) make it easy to convert strings into R’s <code>Date</code> or <code>POSIXct</code> objects.</p>
<section id="what-do-the-letters-y-m-d-stand-for" class="level4">
<h4 class="anchored" data-anchor-id="what-do-the-letters-y-m-d-stand-for">What do the letters <code>y</code>, <code>m</code>, <code>d</code> stand for?</h4>
<p>The functions are named according to the order in which the date components appear in the input string:</p>
<ul>
<li><p><code>y</code> stands for <strong>year</strong></p></li>
<li><p><code>m</code> stands for <strong>month</strong></p></li>
<li><p><code>d</code> stands for <strong>day</strong></p></li>
<li><p><code>h</code>, <code>m</code>, <code>s</code> (used in date-time functions) stand for <strong>hours</strong>, <strong>minutes</strong>, and <strong>seconds</strong></p></li>
</ul>
<p>For example:</p>
<ul>
<li><p><strong><code>ymd()</code></strong> parses a string where the date components are in the order <strong>year-month-day</strong>.</p></li>
<li><p><strong><code>mdy()</code></strong> parses a string formatted as <strong>month-day-year</strong>.</p></li>
<li><p><strong><code>dmy()</code></strong> parses a string in <strong>day-month-year</strong> order.</p></li>
</ul>
<!-- -->
<ul>
<li>Functions: <code>ymd()</code>, <code>mdy()</code>, <code>dmy()</code>, <code>ymd_hms()</code>, <code>mdy_hms()</code>, <code>dmy_hms()</code></li>
</ul>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb5-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(lubridate)</span>
<span id="cb5-2"></span>
<span id="cb5-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Convert date strings to Date objects</span></span>
<span id="cb5-4">date1 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ymd</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"2024-09-30"</span>)</span>
<span id="cb5-5">date1</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] "2024-09-30"</code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb7-1">date2 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">dmy</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"30-09-2024"</span>)</span>
<span id="cb7-2">date2</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] "2024-09-30"</code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb9-1">date3 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mdy</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"09/30/2024"</span>)</span>
<span id="cb9-2">date3</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] "2024-09-30"</code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb11" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb11-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Convert to date-time</span></span>
<span id="cb11-2">datetime1 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ymd_hms</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"2024-09-21 14:45:00"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">tz =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"UTC"</span>)</span>
<span id="cb11-3">datetime1</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] "2024-09-21 14:45:00 UTC"</code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb13" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb13-1">datetime2 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mdy_hms</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"09/21/2024 02:45:00 PM"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">tz =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"America/New_York"</span>)</span>
<span id="cb13-2">datetime2</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] "2024-09-21 14:45:00 EDT"</code></pre>
</div>
</div>
<p>By using specific functions for different formats (<code>ymd()</code>, <code>mdy()</code>, <code>dmy()</code>), you don’t need to worry about the order of date components. This ensures flexibility and reduces errors when working with various data sources.</p>
<p>These functions simplify the process by allowing you to focus only on the structure of the input data and not on specifying complex format strings, as would be necessary with base R functions like <code>as.Date()</code> or <code>strptime()</code>.</p>
</section>
</section>
<section id="extracting-date-time-components" class="level3">
<h3 class="anchored" data-anchor-id="extracting-date-time-components">Extracting Date-Time Components</h3>
<p>Once you have parsed a date-time object using <strong>lubridate</strong>, you often need to extract or modify specific components, such as the year, month, day, or time. This is essential when analyzing data based on time periods, summarizing by year, or creating time-based features for models.</p>
<p><strong>Functions to Extract Date-Time Components</strong></p>
<p>Here are the most commonly used <strong>lubridate</strong> functions to extract specific parts of a date-time object:</p>
<ul>
<li><p><strong><code>year()</code></strong>: Extracts or sets the year.</p></li>
<li><p><strong><code>month()</code></strong>: Extracts or sets the month. This function can also return the month’s name if <code>label = TRUE</code> is used.</p></li>
<li><p><strong><code>day()</code></strong>: Extracts or sets the day of the month.</p></li>
<li><p><strong><code>hour()</code></strong>: Extracts or sets the hour (for time-based objects).</p></li>
<li><p><strong><code>minute()</code></strong>: Extracts or sets the minute.</p></li>
<li><p><strong><code>second()</code></strong>: Extracts or sets the second.</p></li>
<li><p><strong><code>wday()</code></strong>: Extracts the day of the week (can return the weekday’s name if <code>label = TRUE</code>).</p></li>
<li><p><strong><code>yday()</code></strong>: Extracts the day of the year (1–365 or 366 for leap years).</p></li>
<li><p><strong><code>mday()</code></strong>: Extracts the day of the month.</p></li>
</ul>
<p>Let’s work with a parsed date-time object and extract its components:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb15" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb15-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(lubridate)</span>
<span id="cb15-2"></span>
<span id="cb15-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Parsing a date-time object</span></span>
<span id="cb15-4">datetime <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ymd_hms</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"2024-09-30 14:45:30"</span>)</span>
<span id="cb15-5"></span>
<span id="cb15-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Extracting components</span></span>
<span id="cb15-7"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">year</span>(datetime)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 2024</code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb17" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb17-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">month</span>(datetime) </span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 9</code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb19" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb19-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">day</span>(datetime) </span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 30</code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb21" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb21-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">hour</span>(datetime) </span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 14</code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb23" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb23-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">minute</span>(datetime)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 45</code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb25" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb25-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">second</span>(datetime)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 30</code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb27" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb27-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Extracting weekday</span></span>
<span id="cb27-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">wday</span>(datetime)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 2</code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb29" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb29-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">wday</span>(datetime, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">label =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] Mon
Levels: Sun &lt; Mon &lt; Tue &lt; Wed &lt; Thu &lt; Fri &lt; Sat</code></pre>
</div>
</div>
<p>In this example, we extracted different components of the date-time object. The <code>wday()</code> function can return the day of the week either as a number (1 for Sunday, 7 for Saturday) or as a label (the weekday name) when using <code>label = TRUE</code>.</p>
<p>In addition to extraction, <strong>lubridate</strong> allows you to modify specific components of a date or time without manually manipulating the entire string. This is particularly useful when you need to adjust dates or times in your data for analysis or alignment.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb31" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb31-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Modifying components</span></span>
<span id="cb31-2">datetime</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] "2024-09-30 14:45:30 UTC"</code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb33" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb33-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">year</span>(datetime) <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2025</span></span>
<span id="cb33-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">month</span>(datetime) <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">12</span></span>
<span id="cb33-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">hour</span>(datetime) <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span></span>
<span id="cb33-4"></span>
<span id="cb33-5">datetime</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] "2025-12-30 08:45:30 UTC"</code></pre>
</div>
</div>
<p>In this example, the original date-time <code>2024-09-30 14:45:30</code> was modified to change the year, month, and hour, resulting in a new date-time value of <code>2025-12-21 08:45:30</code>.</p>
<p><strong>lubridate</strong> allows you to extract and modify months or weekdays by name as well, which is particularly useful when working with human-readable data or when creating reports:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb35" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb35-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Extracting month by name</span></span>
<span id="cb35-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">month</span>(datetime, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">label =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">abbr =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span>)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] December
12 Levels: January &lt; February &lt; March &lt; April &lt; May &lt; June &lt; ... &lt; December</code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb37" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb37-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Changing the month by name</span></span>
<span id="cb37-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">month</span>(datetime) <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">7</span></span>
<span id="cb37-3">datetime</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] "2025-07-30 08:45:30 UTC"</code></pre>
</div>
</div>
<p>In this example, <code>label = TRUE</code> and <code>abbr = FALSE</code> give the full name of the month (July) instead of the numeric value or abbreviation. You can also modify the month by name for more human-readable processing.</p>
<p>For higher-level time units such as weeks and quarters, <strong>lubridate</strong> offers convenient functions:</p>
<ul>
<li><p><strong><code>week()</code></strong>: Extracts the week of the year (1–52/53).</p></li>
<li><p><strong><code>quarter()</code></strong>: Extracts the quarter of the year (1–4).</p></li>
</ul>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb39" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb39-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Extracting the week number</span></span>
<span id="cb39-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">week</span>(datetime)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 31</code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb41" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb41-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Extracting the quarter</span></span>
<span id="cb41-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">quarter</span>(datetime)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 3</code></pre>
</div>
</div>
</section>
<section id="dealing-with-time-zones" class="level3">
<h3 class="anchored" data-anchor-id="dealing-with-time-zones">Dealing with Time Zones</h3>
<p>Another significant advantage of <strong>lubridate</strong> is that it handles time zones effectively when extracting date-time components. If you work with global datasets, being able to accurately account for time zones is crucial:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb43" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb43-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Set a different time zone</span></span>
<span id="cb43-2">datetime</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] "2025-07-30 08:45:30 UTC"</code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb45" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb45-1">datetime_tz <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">with_tz</span>(datetime, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"America/New_York"</span>)</span>
<span id="cb45-2">datetime_tz</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] "2025-07-30 04:45:30 EDT"</code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb47" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb47-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Extract hour in the new time zone</span></span>
<span id="cb47-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">hour</span>(datetime_tz)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 4</code></pre>
</div>
</div>
<p>Here, we changed the time zone to Eastern Daylight Time (EDT) and extracted the hour component, which adjusted to the new time zone.</p>
</section>
<section id="creating-durations-periods-and-intervals" class="level3">
<h3 class="anchored" data-anchor-id="creating-durations-periods-and-intervals"><strong>Creating Durations, Periods, and Intervals</strong></h3>
<p>In data analysis, we often need to measure time spans, whether to calculate the difference between two dates, schedule recurring events, or model time-based phenomena. <strong>lubridate</strong> offers three powerful time-related concepts to handle these scenarios: <strong>durations</strong>, <strong>periods</strong>, and <strong>intervals</strong>. While they may seem similar, they each serve distinct purposes and behave differently depending on the use case.</p>
<section id="durations" class="level4">
<h4 class="anchored" data-anchor-id="durations"><strong>Durations</strong></h4>
<p>A <strong>duration</strong> is an exact measurement of time, expressed in seconds. Durations are useful when you need precise, unambiguous time differences regardless of calendar variations (such as leap years, varying month lengths, or daylight saving changes).</p>
<ul>
<li><strong>Duration syntax</strong>: You can create durations using the <code>dseconds()</code>, <code>dminutes()</code>, <code>dhours()</code>, <code>ddays()</code>, <code>dweeks()</code>, <code>dyears()</code> functions.</li>
</ul>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb49" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb49-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Creating a duration of 1 day</span></span>
<span id="cb49-2">one_day <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ddays</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)</span>
<span id="cb49-3">one_day</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] "86400s (~1 days)"</code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb51" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb51-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Duration of 2 hours and 30 minutes</span></span>
<span id="cb51-2">duration_time <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">dhours</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">dminutes</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">30</span>)</span>
<span id="cb51-3">duration_time</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] "9000s (~2.5 hours)"</code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb53" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb53-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Adding a duration to a date</span></span>
<span id="cb53-2">start_date <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ymd</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"2024-09-30"</span>)</span>
<span id="cb53-3">end_date <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> start_date <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ddays</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">7</span>)</span>
<span id="cb53-4">end_date</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] "2024-10-07"</code></pre>
</div>
</div>
<p>In this example, <strong>durations</strong> are defined as fixed time lengths. Adding a duration to a date will move the date forward by the exact number of seconds, regardless of any irregularities in the calendar.</p>
</section>
<section id="periods" class="level4">
<h4 class="anchored" data-anchor-id="periods"><strong>Periods</strong></h4>
<p>Unlike durations, <strong>periods</strong> are time spans measured in human calendar terms: years, months, days, hours, etc. Periods account for calendar variations, such as leap years and daylight saving time. This makes periods more intuitive for real-world use cases, but less precise in terms of exact seconds.</p>
<ul>
<li><strong>Period syntax</strong>: Use <code>years()</code>, <code>months()</code>, <code>weeks()</code>, <code>days()</code>, <code>hours()</code>, <code>minutes()</code>, <code>seconds()</code> functions to create periods.</li>
</ul>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb55" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb55-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Creating a period of 2 years, 3 months, and 10 days</span></span>
<span id="cb55-2">my_period <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">years</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">months</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">days</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>)</span>
<span id="cb55-3">my_period </span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] "2y 3m 10d 0H 0M 0S"</code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb57" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb57-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Adding the period to a date</span></span>
<span id="cb57-2">new_date <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> start_date <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> my_period</span>
<span id="cb57-3">new_date</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] "2027-01-09"</code></pre>
</div>
</div>
<p>In this example, the <strong>period</strong> accounts for differences in calendar length (such as varying days in months). The <code>start_date</code> was <code>2024-09-30</code>, and after adding 2 years, 3 months, and 10 days, the result is <code>2027-01-09</code>.</p>
</section>
<section id="intervals" class="level4">
<h4 class="anchored" data-anchor-id="intervals"><strong>Intervals</strong></h4>
<p>An <strong>interval</strong> represents the time span between two specific dates or times. It is useful when you want to measure or compare spans between known start and end points. Intervals take into account the exact length of time between two dates, allowing you to calculate durations or periods over that span.</p>
<ul>
<li><strong>Interval syntax</strong>: Use the <code>interval()</code> function to create an interval between two dates or date-times.</li>
</ul>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb59" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb59-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Creating an interval between two dates</span></span>
<span id="cb59-2">start_date <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ymd</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"2024-01-01"</span>)</span>
<span id="cb59-3">end_date <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ymd</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"2024-12-31"</span>)</span>
<span id="cb59-4">time_interval <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">interval</span>(start_date, end_date)</span>
<span id="cb59-5">time_interval</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 2024-01-01 UTC--2024-12-31 UTC</code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb61" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb61-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Checking how many days/weeks are in the interval</span></span>
<span id="cb61-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">as.duration</span>(time_interval)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] "31536000s (~52.14 weeks)"</code></pre>
</div>
</div>
<p>In this example, an <strong>interval</strong> is created between <code>2024-01-01</code> and <code>2024-12-31</code>. The interval accounts for the exact time between the two dates, and using <code>as.duration()</code> allows us to calculate the number of seconds (or days/weeks) in that interval.</p>
<p>Sometimes you need to combine these time spans to perform calculations or model time-based processes. For example, you might want to measure the duration of an interval and adjust it using a period.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb63" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb63-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Create an interval between two dates</span></span>
<span id="cb63-2">start_date <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ymd</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"2024-09-01"</span>)</span>
<span id="cb63-3">end_date <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ymd</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"2024-12-01"</span>)</span>
<span id="cb63-4">interval_span <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">interval</span>(start_date, end_date)</span>
<span id="cb63-5">interval_span</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 2024-09-01 UTC--2024-12-01 UTC</code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb65" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb65-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Extend the end date by 1 month</span></span>
<span id="cb65-2">new_end_date <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> end_date <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">months</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)</span>
<span id="cb65-3"></span>
<span id="cb65-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Create a new interval with the updated end date</span></span>
<span id="cb65-5">extended_interval <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">interval</span>(start_date, new_end_date)</span>
<span id="cb65-6"></span>
<span id="cb65-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Display the extended interval</span></span>
<span id="cb65-8">extended_interval</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 2024-09-01 UTC--2025-01-01 UTC</code></pre>
</div>
</div>
<ul>
<li><p><strong>Original interval</strong>: We first create the interval <code>interval_span</code> between <code>2024-09-01</code> and <code>2024-12-01</code>.</p></li>
<li><p><strong>Adding 1 month</strong>: Instead of adding the period to the interval directly, we add <code>months(1)</code> to the end date (<code>end_date + months(1)</code>).</p></li>
<li><p><strong>New interval</strong>: We then create a new interval using the original start date and the updated end date (<code>new_end_date</code>).</p></li>
</ul>
</section>
</section>
<section id="date-arithmetic" class="level3">
<h3 class="anchored" data-anchor-id="date-arithmetic">Date Arithmetic</h3>
<p>Date arithmetic is a fundamental aspect of working with date-time data, especially in data analysis and time series forecasting. The <strong>lubridate</strong> package makes it easy to perform arithmetic operations on date-time objects, enabling users to manipulate dates effectively. This section discusses common date arithmetic operations, including adding and subtracting time intervals, calculating durations, and handling periods.</p>
<p>You can perform basic arithmetic operations directly on date-time objects. These operations include addition and subtraction of various time intervals.</p>
<p><strong>Adding Days to a Date:</strong></p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb67" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb67-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Define a starting date</span></span>
<span id="cb67-2">start_date <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ymd</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"2024-01-01"</span>)</span>
<span id="cb67-3"></span>
<span id="cb67-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Add 30 days to the starting date</span></span>
<span id="cb67-5">new_date <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> start_date <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">days</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">30</span>)</span>
<span id="cb67-6"></span>
<span id="cb67-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Display the new date</span></span>
<span id="cb67-8">new_date</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] "2024-01-31"</code></pre>
</div>
</div>
<p>In this example:</p>
<ul>
<li><p>We define a starting date using <code>ymd()</code>.</p></li>
<li><p>We add 30 days to this date using the <code>days()</code> function.</p></li>
<li><p>The result is a new date that is 30 days later.</p></li>
</ul>
<p><strong>Subtracting Days from a Date:</strong></p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb69" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb69-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Subtract 15 days from the starting date</span></span>
<span id="cb69-2">previous_date <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> start_date <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">days</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">15</span>)</span>
<span id="cb69-3"></span>
<span id="cb69-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Display the previous date</span></span>
<span id="cb69-5">previous_date</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] "2023-12-17"</code></pre>
</div>
</div>
<p>Here, we demonstrate how to subtract days from a date. This operation can also be performed with other time intervals, such as months, years, hours, etc.</p>
<p>Date arithmetic is commonly used in various practical applications, such as:</p>
<ul>
<li><p><strong>Time Series Analysis</strong>: Analyzing trends over specific periods (e.g., monthly sales growth).</p></li>
<li><p><strong>Event Planning</strong>: Calculating the duration between events (e.g., project deadlines).</p></li>
<li><p><strong>Scheduling</strong>: Determining time slots for meetings or tasks based on calendar events.</p></li>
</ul>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb71" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb71-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Define task durations</span></span>
<span id="cb71-2">task_duration <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">hours</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>)  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Each task takes 3 hours</span></span>
<span id="cb71-3">start_time <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ymd_hms</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"2024-01-01 09:00:00"</span>)</span>
<span id="cb71-4"></span>
<span id="cb71-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Schedule three tasks</span></span>
<span id="cb71-6">schedule <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> start_time <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> task_duration <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span></span>
<span id="cb71-7"></span>
<span id="cb71-8"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Display the schedule for tasks</span></span>
<span id="cb71-9">schedule</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] "2024-01-01 09:00:00 UTC" "2024-01-01 12:00:00 UTC"
[3] "2024-01-01 15:00:00 UTC"</code></pre>
</div>
</div>
<p>In this example, we define a 3-hour task duration and schedule three tasks based on the start time, displaying their scheduled times.</p>
</section>
</section>
<section id="using-lubridate-with-time-series-data-in-r" class="level2">
<h2 class="anchored" data-anchor-id="using-lubridate-with-time-series-data-in-r">Using lubridate with Time Series Data in R</h2>
<p>In time series analysis, properly handling date and time variables is crucial for ensuring accurate results. <strong>lubridate</strong> simplifies working with dates and times, but it’s also important to know how to integrate it with base R’s time series objects like <code>ts</code> and more flexible formats like date-time data frames.</p>
<section id="creating-time-series-with-ts-in-r" class="level3">
<h3 class="anchored" data-anchor-id="creating-time-series-with-ts-in-r"><strong>Creating Time Series with <code>ts()</code> in R</strong></h3>
<p>Base R’s <code>ts</code> function is typically used to create regular time series objects. Time series data must have a defined frequency (e.g., daily, monthly, quarterly) and a starting point.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb73" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb73-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Sample data: monthly sales from 2020 to 2022</span></span>
<span id="cb73-2">sales_data <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">120</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">150</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">170</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">160</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">130</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">140</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">180</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">200</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">190</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">210</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">220</span>,</span>
<span id="cb73-3">                <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">230</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">250</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">270</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">300</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">280</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">260</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">290</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">310</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">330</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">340</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">350</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">360</span>)</span>
<span id="cb73-4"></span>
<span id="cb73-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Creating a time series object (monthly data starting from Jan 2020)</span></span>
<span id="cb73-6">ts_sales <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ts</span>(sales_data, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">start =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2020</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">frequency =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">12</span>)</span>
<span id="cb73-7">ts_sales</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>     Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2020 100 120 150 170 160 130 140 180 200 190 210 220
2021 230 250 270 300 280 260 290 310 330 340 350 360</code></pre>
</div>
</div>
<p>This code creates a time series object representing monthly sales from January 2020 to December 2021.</p>
<ul>
<li><p><code>start = c(2020, 1)</code> indicates the time series starts in January 2020.</p></li>
<li><p><code>frequency = 12</code> specifies that the data is monthly (12 periods per year).</p></li>
</ul>
</section>
<section id="converting-a-ts-object-to-a-data-frame-with-a-date-variable" class="level3">
<h3 class="anchored" data-anchor-id="converting-a-ts-object-to-a-data-frame-with-a-date-variable"><strong>Converting a <code>ts</code> Object to a Data Frame with a Date Variable</strong></h3>
<p>When working with time series data, we often need to convert a <code>ts</code> object into a data frame to analyze it along with specific dates. <strong>lubridate</strong> can be used to handle date conversions easily.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb75" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb75-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Convert time series to a data frame with date information</span></span>
<span id="cb75-2">sales_df <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(</span>
<span id="cb75-3">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">date =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">seq</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ymd</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"2020-01-01"</span>), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">by =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"month"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">length.out =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">length</span>(ts_sales)),</span>
<span id="cb75-4">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sales =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">as.numeric</span>(ts_sales)</span>
<span id="cb75-5">)</span>
<span id="cb75-6"></span>
<span id="cb75-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Display the resulting data frame</span></span>
<span id="cb75-8">sales_df</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>         date sales
1  2020-01-01   100
2  2020-02-01   120
3  2020-03-01   150
4  2020-04-01   170
5  2020-05-01   160
6  2020-06-01   130
7  2020-07-01   140
8  2020-08-01   180
9  2020-09-01   200
10 2020-10-01   190
11 2020-11-01   210
12 2020-12-01   220
13 2021-01-01   230
14 2021-02-01   250
15 2021-03-01   270
16 2021-04-01   300
17 2021-05-01   280
18 2021-06-01   260
19 2021-07-01   290
20 2021-08-01   310
21 2021-09-01   330
22 2021-10-01   340
23 2021-11-01   350
24 2021-12-01   360</code></pre>
</div>
</div>
<p>In this example, we:</p>
<ul>
<li><p>Convert the <code>ts</code> object to a numeric vector (<code>as.numeric(ts_sales)</code>).</p></li>
<li><p>Use <code>seq()</code> and <strong>lubridate’s</strong> <code>ymd()</code> function to create a sequence of dates starting from <code>"2020-01-01"</code>, incrementing monthly (<code>by = "month"</code>).</p></li>
<li><p>The result is a data frame with a <code>date</code> column containing actual dates and a <code>sales</code> column with the sales data.</p></li>
</ul>
</section>
<section id="creating-time-series-from-date-time-data" class="level3">
<h3 class="anchored" data-anchor-id="creating-time-series-from-date-time-data"><strong>Creating Time Series from Date-Time Data</strong></h3>
<p>Time series data can also be created directly from date-time information, such as daily, hourly, or minute-based data. <strong>lubridate</strong> can be used to efficiently generate or manipulate such time series.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb77" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb77-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Generate a sequence of daily dates</span></span>
<span id="cb77-2">daily_dates <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">seq</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ymd</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"2023-01-01"</span>), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">by =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"day"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">length.out =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">30</span>)</span>
<span id="cb77-3"></span>
<span id="cb77-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Create a sample dataset with random values for each day</span></span>
<span id="cb77-5">daily_data <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(</span>
<span id="cb77-6">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">date =</span> daily_dates,</span>
<span id="cb77-7">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">value =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">runif</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">30</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">min =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">max =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">200</span>)</span>
<span id="cb77-8">)</span>
<span id="cb77-9"></span>
<span id="cb77-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># View the first few rows of the dataset</span></span>
<span id="cb77-11"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">head</span>(daily_data)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>        date    value
1 2023-01-01 114.0204
2 2023-01-02 158.4874
3 2023-01-03 102.0644
4 2023-01-04 119.3779
5 2023-01-05 167.0183
6 2023-01-06 185.6002</code></pre>
</div>
</div>
<p>In this example, we create a time series dataset for daily data:</p>
<ul>
<li><p><strong><code>ymd()</code></strong> is used to generate a sequence of daily dates starting from <code>"2023-01-01"</code>.</p></li>
<li><p><strong><code>runif()</code></strong> generates random values to simulate daily observations.</p></li>
</ul>
<p>You can use this type of time series in various analysis techniques, including plotting trends over time or aggregating data by week, month, or year.</p>
</section>
<section id="working-with-time-series-intervals" class="level3">
<h3 class="anchored" data-anchor-id="working-with-time-series-intervals"><strong>Working with Time Series Intervals</strong></h3>
<p>Sometimes, you need to manipulate time series data by grouping or splitting it into different intervals. <strong>lubridate</strong> makes this task easier by providing intuitive functions to work with intervals, durations, and periods.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb79" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb79-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(dplyr)</span>
<span id="cb79-2"></span>
<span id="cb79-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Sample dataset: daily values over one month</span></span>
<span id="cb79-4"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">set.seed</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">123</span>)</span>
<span id="cb79-5">time_series_data <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(</span>
<span id="cb79-6">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">date =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">seq</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ymd</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"2023-01-01"</span>), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">by =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"day"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">length.out =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">30</span>),</span>
<span id="cb79-7">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">value =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">runif</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">30</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">min =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">50</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">max =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">150</span>)</span>
<span id="cb79-8">)</span>
<span id="cb79-9"></span>
<span id="cb79-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Aggregating the data by week</span></span>
<span id="cb79-11">weekly_data <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> time_series_data <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb79-12">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">week =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">floor_date</span>(date, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"week"</span>)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb79-13">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">group_by</span>(week) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb79-14">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summarize</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">weekly_avg =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(value))</span>
<span id="cb79-15"></span>
<span id="cb79-16"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># View the aggregated data</span></span>
<span id="cb79-17">weekly_data</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code># A tibble: 5 × 2
  week       weekly_avg
  &lt;date&gt;          &lt;dbl&gt;
1 2023-01-01      105. 
2 2023-01-08      115. 
3 2023-01-15       99.5
4 2023-01-22      119. 
5 2023-01-29       71.8</code></pre>
</div>
</div>
<p>Here, we use <strong>lubridate’s</strong> <code>floor_date()</code> function to round each date down to the start of its respective week. The data is then grouped by week and summarized to compute the weekly average. This approach can easily be adapted for other time periods like months or quarters using <code>floor_date(date, "month")</code>.</p>
</section>
<section id="handling-irregular-time-series" class="level3">
<h3 class="anchored" data-anchor-id="handling-irregular-time-series"><strong>Handling Irregular Time Series</strong></h3>
<p>Not all time series data comes in regular intervals (e.g., daily, weekly). For irregular time series, <strong>lubridate</strong> can be used to efficiently handle missing or irregular dates.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb81" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb81-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Example of irregular dates (missing some days)</span></span>
<span id="cb81-2">irregular_dates <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ymd</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"2023-01-01"</span>), <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ymd</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"2023-01-02"</span>), <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ymd</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"2023-01-05"</span>),</span>
<span id="cb81-3">                     <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ymd</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"2023-01-07"</span>), <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ymd</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"2023-01-10"</span>))</span>
<span id="cb81-4"></span>
<span id="cb81-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Create a dataset with missing dates</span></span>
<span id="cb81-6">irregular_data <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(</span>
<span id="cb81-7">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">date =</span> irregular_dates,</span>
<span id="cb81-8">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">value =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">runif</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">min =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">max =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">200</span>)</span>
<span id="cb81-9">)</span>
<span id="cb81-10"></span>
<span id="cb81-11"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Complete the time series by filling missing dates</span></span>
<span id="cb81-12">complete_dates <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(</span>
<span id="cb81-13">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">date =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">seq</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">min</span>(irregular_data<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>date), <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">max</span>(irregular_data<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>date), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">by =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"day"</span>)</span>
<span id="cb81-14">)</span>
<span id="cb81-15"></span>
<span id="cb81-16"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Join the original data with the complete sequence of dates</span></span>
<span id="cb81-17">complete_data <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">merge</span>(complete_dates, irregular_data, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">by =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"date"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">all.x =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>)</span>
<span id="cb81-18"></span>
<span id="cb81-19"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># View the completed data with missing values</span></span>
<span id="cb81-20">complete_data</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>         date    value
1  2023-01-01 196.3024
2  2023-01-02 190.2299
3  2023-01-03       NA
4  2023-01-04       NA
5  2023-01-05 169.0705
6  2023-01-06       NA
7  2023-01-07 179.5467
8  2023-01-08       NA
9  2023-01-09       NA
10 2023-01-10 102.4614</code></pre>
</div>
</div>
<p>In this example:</p>
<ul>
<li><p><strong>lubridate</strong>’s <code>ymd()</code> is used to handle irregular dates.</p></li>
<li><p>We fill missing dates by generating a complete sequence of dates (<code>seq()</code>) and merging it with the original data using <code>merge()</code>.</p></li>
<li><p>Missing values are introduced in the <code>value</code> column for dates that were absent in the original data.</p></li>
</ul>
</section>
<section id="using-time-series-formats-with-lubridate-functions" class="level3">
<h3 class="anchored" data-anchor-id="using-time-series-formats-with-lubridate-functions"><strong>Using Time Series Formats with <code>lubridate</code> Functions</strong></h3>
<p>You can combine <strong>lubridate</strong> functions with base R’s <code>ts</code> objects for more flexible time series analysis. For example, extracting specific components from a <code>ts</code> series, such as year, month, or week, can be achieved using <strong>lubridate</strong>.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb83" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb83-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Converting a ts object to a data frame with dates</span></span>
<span id="cb83-2">ts_data <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ts</span>(sales_data, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">start =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2020</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">frequency =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">12</span>)</span>
<span id="cb83-3"></span>
<span id="cb83-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Create a data frame from the ts object</span></span>
<span id="cb83-5">df_ts <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(</span>
<span id="cb83-6">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">date =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">seq</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ymd</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"2020-01-01"</span>), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">by =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"month"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">length.out =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">length</span>(ts_data)),</span>
<span id="cb83-7">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sales =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">as.numeric</span>(ts_data)</span>
<span id="cb83-8">)</span>
<span id="cb83-9"></span>
<span id="cb83-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Extract year and month using lubridate</span></span>
<span id="cb83-11">df_ts <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> df_ts <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb83-12">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">year =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">year</span>(date), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">month =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">month</span>(date))</span>
<span id="cb83-13"></span>
<span id="cb83-14"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># View the data with extracted components</span></span>
<span id="cb83-15">df_ts</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>         date sales year month
1  2020-01-01   100 2020     1
2  2020-02-01   120 2020     2
3  2020-03-01   150 2020     3
4  2020-04-01   170 2020     4
5  2020-05-01   160 2020     5
6  2020-06-01   130 2020     6
7  2020-07-01   140 2020     7
8  2020-08-01   180 2020     8
9  2020-09-01   200 2020     9
10 2020-10-01   190 2020    10
11 2020-11-01   210 2020    11
12 2020-12-01   220 2020    12
13 2021-01-01   230 2021     1
14 2021-02-01   250 2021     2
15 2021-03-01   270 2021     3
16 2021-04-01   300 2021     4
17 2021-05-01   280 2021     5
18 2021-06-01   260 2021     6
19 2021-07-01   290 2021     7
20 2021-08-01   310 2021     8
21 2021-09-01   330 2021     9
22 2021-10-01   340 2021    10
23 2021-11-01   350 2021    11
24 2021-12-01   360 2021    12</code></pre>
</div>
</div>
<p>Here, we convert the <code>ts</code> object into a data frame and use <strong>lubridate</strong>’s <code>year()</code> and <code>month()</code> functions to extract date components, which can be used for further analysis (e.g., grouping by month or year).</p>
</section>
</section>
<section id="solving-real-world-date-time-issues" class="level2">
<h2 class="anchored" data-anchor-id="solving-real-world-date-time-issues">Solving Real-World Date-Time Issues</h2>
<p>Handling date-time data in real-world applications often involves dealing with a variety of formats and potential inconsistencies. The <strong>lubridate</strong> package provides powerful functions to parse, manipulate, and format date-time data efficiently. This section focuses on how to use these functions, especially <code>parse_date_time()</code>, to address common date-time challenges.</p>
<p>When working with datasets, date-time values may not always be in a standard format. For instance, you might encounter dates represented as strings in various formats like <code>"YYYY-MM-DD"</code>, <code>"MM/DD/YYYY"</code>, or even <code>"Month DD, YYYY"</code>. To perform analysis accurately, it’s crucial to convert these strings into proper date-time objects.</p>
<p>The <code>parse_date_time()</code> function is one of the most versatile functions in the <strong>lubridate</strong> package. It allows you to specify multiple possible formats for parsing a date-time string. This flexibility is especially useful when dealing with datasets from different sources or with inconsistent date formats.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb85" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb85-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">parse_date_time</span>(x, orders, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">tz =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"UTC"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">quiet =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span>)</span></code></pre></div></div>
</div>
<ul>
<li><p><strong><code>x</code></strong>: A character vector of date-time strings to be parsed.</p></li>
<li><p><strong><code>orders</code></strong>: A vector of possible formats for the date-time strings (e.g., <code>"ymd"</code>, <code>"mdy"</code>, etc.).</p></li>
<li><p><strong><code>tz</code></strong>: The time zone to use (default is <code>"UTC"</code>).</p></li>
<li><p><strong><code>quiet</code></strong>: If <code>TRUE</code>, suppress warnings.</p></li>
</ul>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb86" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb86-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Example date-time strings in various formats</span></span>
<span id="cb86-2">dates <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"2024-01-15"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"01/16/2024"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"March 17, 2024"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"18-04-2024"</span>)</span>
<span id="cb86-3"></span>
<span id="cb86-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Parse the dates using parse_date_time</span></span>
<span id="cb86-5">parsed_dates <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">parse_date_time</span>(dates, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">orders =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"ymd"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"mdy"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"dmy"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"B d, Y"</span>))</span>
<span id="cb86-6"></span>
<span id="cb86-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Display the parsed dates</span></span>
<span id="cb86-8">parsed_dates</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] "2024-01-15 UTC" "2024-01-16 UTC" "2024-03-17 UTC" "2024-04-18 UTC"</code></pre>
</div>
</div>
<p>In this example:</p>
<ul>
<li><p>The <code>dates</code> vector contains strings in various formats.</p></li>
<li><p>The <code>parse_date_time()</code> function attempts to parse each date according to the specified orders.</p></li>
<li><p>The output is a vector of parsed date-time objects, all converted to the same format.</p></li>
</ul>
</section>
<section id="alternative-packages-and-comparison-with-lubridate" class="level2">
<h2 class="anchored" data-anchor-id="alternative-packages-and-comparison-with-lubridate">Alternative Packages and Comparison with <code>lubridate</code></h2>
<p>Several R packages can handle date-time data, each with its strengths and weaknesses. Below, we discuss these packages, comparing their functionalities with those of the <strong>lubridate</strong> package.</p>
<section id="base-r-functions" class="level3">
<h3 class="anchored" data-anchor-id="base-r-functions"><strong>Base R Functions</strong></h3>
<p><strong>Similarities:</strong></p>
<ul>
<li>Both <strong>lubridate</strong> and base R offer essential functions for converting character strings to date or date-time objects (e.g., <code>as.Date()</code>, <code>as.POSIXct()</code>).</li>
</ul>
<p><strong>Differences:</strong></p>
<ul>
<li>Base R functions require more manual handling of date-time formats, whereas <strong>lubridate</strong> offers a more user-friendly and intuitive syntax for parsing and manipulating dates.</li>
</ul>
<p><strong>Advantages of Base R:</strong></p>
<ul>
<li><p>No additional package installation is required, making it lightweight.</p></li>
<li><p>Suitable for basic date-time manipulations.</p></li>
</ul>
<p><strong>Disadvantages of Base R:</strong></p>
<ul>
<li><p>Limited functionality for complex date-time operations.</p></li>
<li><p>Syntax can be less intuitive, especially for beginners.</p></li>
</ul>
</section>
<section id="chron-package" class="level3">
<h3 class="anchored" data-anchor-id="chron-package"><strong><code>chron</code> Package</strong></h3>
<p><strong>Similarities:</strong></p>
<ul>
<li>Both <strong>chron</strong> and <strong>lubridate</strong> provide functionalities for working with dates and times, making it easy to manage these data types.</li>
</ul>
<p><strong>Differences:</strong></p>
<ul>
<li><strong>chron</strong> is focused more on simpler date-time representations and does not handle time zones as effectively as <strong>lubridate</strong>.</li>
</ul>
<p><strong>Advantages of <code>chron</code>:</strong></p>
<ul>
<li><p>Straightforward for handling date-time data without complexity.</p></li>
<li><p>Lightweight and easy to use for simple applications.</p></li>
</ul>
<p><strong>Disadvantages of <code>chron</code>:</strong></p>
<ul>
<li><p>Lacks advanced features for manipulating dates and times.</p></li>
<li><p>Limited support for time zones and complex date-time arithmetic.</p></li>
</ul>
</section>
<section id="data.table-package" class="level3">
<h3 class="anchored" data-anchor-id="data.table-package"><strong><code>data.table</code> Package</strong></h3>
<p><strong>Similarities:</strong></p>
<ul>
<li>Both packages allow for efficient date-time operations, and <strong>data.table</strong> provides functions to convert to date objects (e.g., <code>as.IDate()</code>).</li>
</ul>
<p><strong>Differences:</strong></p>
<ul>
<li><strong>data.table</strong> is primarily a data manipulation package optimized for speed and performance, whereas <strong>lubridate</strong> focuses specifically on date-time operations.</li>
</ul>
<p><strong>Advantages of <code>data.table</code>:</strong></p>
<ul>
<li><p>Excellent performance with large datasets.</p></li>
<li><p>Integrates well with data manipulation tasks, including date-time operations.</p></li>
</ul>
<p><strong>Disadvantages of <code>data.table</code>:</strong></p>
<ul>
<li><p>More complex syntax, especially for users unfamiliar with data.table conventions.</p></li>
<li><p>Primarily focused on data manipulation rather than dedicated date-time handling.</p></li>
</ul>
</section>
<section id="zoo-and-xts-packages" class="level3">
<h3 class="anchored" data-anchor-id="zoo-and-xts-packages"><strong><code>zoo</code> and <code>xts</code> Packages</strong></h3>
<p><strong>Similarities:</strong></p>
<ul>
<li>Both <strong>zoo</strong> and <strong>xts</strong> provide tools for handling time series data and can manage date-time objects effectively.</li>
</ul>
<p><strong>Differences:</strong></p>
<ul>
<li><strong>lubridate</strong> excels in date-time parsing and manipulation, while <strong>zoo</strong> and <strong>xts</strong> focus more on creating and manipulating time series objects.</li>
</ul>
<p><strong>Advantages of <code>zoo</code> and <code>xts</code>:</strong></p>
<ul>
<li><p>Specialized for handling irregularly spaced time series.</p></li>
<li><p>Provides robust tools for time series analysis, including indexing and subsetting.</p></li>
</ul>
<p><strong>Disadvantages of <code>zoo</code> and <code>xts</code>:</strong></p>
<ul>
<li><p>Not as intuitive for general date-time manipulation tasks.</p></li>
<li><p>Requires additional knowledge of time series concepts.</p></li>
</ul>
</section>
<section id="advantages-of-lubridate" class="level3">
<h3 class="anchored" data-anchor-id="advantages-of-lubridate">Advantages of <code>lubridate</code></h3>
<ol type="1">
<li><p><strong>User-Friendly Syntax</strong>: <strong>lubridate</strong> offers intuitive functions for parsing, manipulating, and formatting date-time objects, making it accessible to users of all skill levels.</p></li>
<li><p><strong>Flexible Parsing</strong>: It can automatically recognize and parse multiple date-time formats, reducing the need for manual formatting.</p></li>
<li><p><strong>Comprehensive Functionality</strong>: Provides a wide range of functions for date-time arithmetic, extracting components, and working with durations, periods, and intervals.</p></li>
<li><p><strong>Time Zone Handling</strong>: Strong support for working with time zones, making it easy to convert between different zones.</p></li>
</ol>
</section>
<section id="disadvantages-of-lubridate" class="level3">
<h3 class="anchored" data-anchor-id="disadvantages-of-lubridate">Disadvantages of <code>lubridate</code></h3>
<ol type="1">
<li><p><strong>Performance</strong>: For very large datasets, <strong>lubridate</strong> may not be as performant as packages like <strong>data.table</strong> or <strong>xts</strong> due to its more extensive functionality and overhead.</p></li>
<li><p><strong>Learning Curve</strong>: Although user-friendly, beginners may still face a learning curve when transitioning from basic date-time manipulation in base R to more advanced functionalities in <strong>lubridate</strong>.</p></li>
<li><p><strong>Dependency</strong>: Requires installation of an additional package, which may not be ideal for all projects or environments.</p></li>
</ol>
</section>
<section id="conclusion" class="level3">
<h3 class="anchored" data-anchor-id="conclusion">Conclusion</h3>
<p>The <code>lubridate</code> package is a powerful tool for handling date and time data in R, offering user-friendly functions for parsing, manipulating, and formatting date-time objects. Key features include:</p>
<ul>
<li><p><strong>Flexible Parsing</strong>: Functions like <code>ymd()</code>, <code>mdy()</code>, and <code>parse_date_time()</code> make it easy to convert various formats into date-time objects.</p></li>
<li><p><strong>Component Extraction</strong>: Extracting components such as year, month, and day with functions like <code>year()</code> and <code>month()</code> simplifies detailed analysis.</p></li>
<li><p><strong>Time Measurements</strong>: Creating durations, periods, and intervals allows for nuanced time calculations, enhancing temporal analysis.</p></li>
</ul>
<p>While <code>lubridate</code> excels in usability and flexibility, it’s important to consider its performance limitations with large datasets and the potential learning curve for new users. Comparing it with alternatives like base R, <code>chron</code>, <code>data.table</code>, <code>zoo</code>, and <code>xts</code> reveals that each package has its strengths, but <code>lubridate</code> stands out for its comprehensive approach to date-time manipulation.</p>
<p>Incorporating <code>lubridate</code> into your R workflow will streamline your date-time processing, enabling more efficient data analysis and deeper insights.</p>
<p>For more information, refer to the <a href="https://lubridate.tidyverse.org/">official lubridate documentation</a>.</p>


</section>
</section>

 ]]></description>
  <category>R Programming</category>
  <category>lubridate</category>
  <category>time series</category>
  <category>time manipulation</category>
  <category>date handling</category>
  <guid>https://mfatihtuzen.github.io/posts/2024-09-30_lubridate/</guid>
  <pubDate>Mon, 30 Sep 2024 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Mastering Data Transformation in R with pivot_longer and pivot_wider</title>
  <dc:creator>M. Fatih Tüzen</dc:creator>
  <link>https://mfatihtuzen.github.io/posts/2024-09-19_pivot/</link>
  <description><![CDATA[ 






<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><a href="https://www.pipinghotdata.com/posts/2021-08-27-a-tidyverse-pivot-approach-to-data-preparation-in-r/"><img src="https://mfatihtuzen.github.io/posts/2024-09-19_pivot/pivot.jpg" class="img-fluid quarto-figure quarto-figure-center figure-img"></a></p>
</figure>
</div>
<figcaption>Artwork by: Shannon Pileggi and Allison Horst</figcaption>
</figure>
</div>
<section id="introduction" class="level2">
<h2 class="anchored" data-anchor-id="introduction">Introduction</h2>
<p>Data analysis requires a deep understanding of how to structure data effectively. Often, datasets are not in the format most suitable for analysis or visualization. That’s where data transformation comes in. Converting data between wide (horizontal) and long (vertical) formats is an essential skill for any data analyst or scientist, ensuring that data is correctly organized for tasks such as statistical modeling, machine learning, or visualization.</p>
<p>The concept of tidy data plays a crucial role in this process. Tidy data principles advocate for a structure where each variable forms a column and each observation forms a row. This consistent structure facilitates easier and more effective data manipulation, analysis, and visualization. By adhering to these principles, you can ensure that your data is well-organized and suited to various analytical tasks.</p>
<p>In this post, we’ll dive into data transformation using the <code>tidyr</code> package in R, specifically focusing on the <code>pivot_longer()</code> and <code>pivot_wider()</code> functions. We’ll explore their theoretical background, use cases, and the importance of reshaping data in data science. Additionally, we’ll discuss when and why we should use wide or long formats, and analyze their advantages and disadvantages.</p>
</section>
<section id="why-data-transformation-is-essential" class="level2">
<h2 class="anchored" data-anchor-id="why-data-transformation-is-essential">Why Data Transformation is Essential</h2>
<p>In data science, structuring data appropriately can be the difference between smooth analysis and frustrating errors. Here’s why reshaping data matters:</p>
<ul>
<li><p><strong>Preparation for modeling</strong>: Many machine learning algorithms require data in long format, where each observation is represented by a single row.</p></li>
<li><p><strong>Improved visualization</strong>: Libraries like <code>ggplot2</code> in R are designed to work best with long data, allowing for more flexible and detailed plots.</p></li>
<li><p><strong>Data management and reporting</strong>: Certain summary statistics or reports are more intuitive when the data is presented in a wide format, making tables easier to interpret.</p></li>
</ul>
<p>Choosing the correct format can optimize both data handling and the clarity of your analysis.</p>
</section>
<section id="theoretical-overview" class="level2">
<h2 class="anchored" data-anchor-id="theoretical-overview">Theoretical Overview</h2>
<ul>
<li><p><strong><code>pivot_longer()</code></strong>: Converts wide-format data (where variables are spread across columns) into a long format (where each variable is in a single column). This is particularly useful when you need to simplify your dataset for analysis or visualization.</p></li>
<li><p><strong><code>pivot_wider()</code></strong>: Converts long-format data (where values are repeated across rows) into wide format, useful when data summarization or comparison across categories is required.</p></li>
</ul>
<p><strong>Function Arguments:</strong></p>
<ul>
<li><p><code>pivot_longer()</code>:</p>
<ul>
<li><p><code>data</code>: The dataset to be transformed.</p></li>
<li><p><code>cols</code>: Specifies the columns to pivot from wide to long.</p></li>
<li><p><code>names_to</code>: The name of the new column that will store the pivoted column names.</p></li>
<li><p><code>values_to</code>: The name of the new column that will store the pivoted values.</p></li>
<li><p><code>values_drop_na</code>: Drops rows where the pivoted value is <code>NA</code> if set to <code>TRUE</code>.</p></li>
</ul></li>
<li><p><code>pivot_wider()</code>:</p>
<ul>
<li><p><code>data</code>: The dataset to be transformed.</p></li>
<li><p><code>names_from</code>: Specifies which column’s values should become the column names in the wide format.</p></li>
<li><p><code>values_from</code>: The column that contains the values to fill into the new wide-format columns.</p></li>
<li><p><code>values_fill</code>: A value to fill missing entries when transforming to wide format.</p></li>
</ul></li>
</ul>
</section>
<section id="advantages-and-disadvantages-of-wide-vs.-long-formats" class="level2">
<h2 class="anchored" data-anchor-id="advantages-and-disadvantages-of-wide-vs.-long-formats">Advantages and Disadvantages of Wide vs.&nbsp;Long Formats</h2>
<table class="caption-top table">
<colgroup>
<col style="width: 49%">
<col style="width: 50%">
</colgroup>
<thead>
<tr class="header">
<th><strong>Wide Format</strong></th>
<th><strong>Long Format</strong></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>Advantages</strong>: Easier to read for summary tables and simple reports. Can be more efficient for certain statistical summaries (e.g., total sales per month).</td>
<td><strong>Advantages</strong>: Ideal for detailed analysis and visualization (e.g., time series plots). Allows flexible data manipulation and easier grouping/summarization.</td>
</tr>
<tr class="even">
<td><strong>Disadvantages</strong>: Can become unwieldy with many variables or time points. Not suitable for machine learning or statistical models that expect long data.</td>
<td><strong>Disadvantages</strong>: Harder to interpret at a glance. May require more computational resources when handling large datasets.</td>
</tr>
</tbody>
</table>
<p><strong>When to Use Wide Format</strong>: Wide format is best for reporting, as it condenses information into fewer rows and is often more visually intuitive in summary tables.</p>
<p><strong>When to Use Long Format</strong>: Long format is essential for most analysis, particularly when working with time-series data, categorical data, or preparing data for machine learning algorithms.</p>
</section>
<section id="some-examples" class="level2">
<h2 class="anchored" data-anchor-id="some-examples">Some Examples</h2>
<section id="basic-data-transformation-using-pivot_longer" class="level3">
<h3 class="anchored" data-anchor-id="basic-data-transformation-using-pivot_longer">Basic Data Transformation Using <code>pivot_longer()</code></h3>
<p>Let’s revisit the monthly sales data:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(tidyr)</span>
<span id="cb1-2">sales_data <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(</span>
<span id="cb1-3">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">product =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"A"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"B"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"C"</span>),</span>
<span id="cb1-4">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">Jan =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">500</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">600</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">300</span>),</span>
<span id="cb1-5">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">Feb =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">450</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">700</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">320</span>),</span>
<span id="cb1-6">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">Mar =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">520</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">640</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">310</span>)</span>
<span id="cb1-7">)</span>
<span id="cb1-8">sales_data</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>  product Jan Feb Mar
1       A 500 450 520
2       B 600 700 640
3       C 300 320 310</code></pre>
</div>
</div>
<p>Using <code>pivot_longer()</code>, we convert it to a long format:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb3-1">sales_long <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pivot_longer</span>(sales_data, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">cols =</span> Jan<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>Mar, </span>
<span id="cb3-2">                           <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">names_to =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"month"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">values_to =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"sales"</span>)</span>
<span id="cb3-3">sales_long</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code># A tibble: 9 × 3
  product month sales
  &lt;chr&gt;   &lt;chr&gt; &lt;dbl&gt;
1 A       Jan     500
2 A       Feb     450
3 A       Mar     520
4 B       Jan     600
5 B       Feb     700
6 B       Mar     640
7 C       Jan     300
8 C       Feb     320
9 C       Mar     310</code></pre>
</div>
</div>
<p>This format is perfect for generating time-series visualizations, analyzing trends, or feeding the data into statistical models that expect a single observation per row.</p>
</section>
<section id="reshaping-data-with-pivot_wider" class="level3">
<h3 class="anchored" data-anchor-id="reshaping-data-with-pivot_wider">Reshaping Data with <code>pivot_wider()</code></h3>
<p>Now, let’s take the long-format data from Example 1 and use <code>pivot_wider()</code> to convert it back to wide format:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb5-1">sales_wide <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pivot_wider</span>(sales_long, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">names_from =</span> month, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">values_from =</span> sales)</span>
<span id="cb5-2">sales_wide</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code># A tibble: 3 × 4
  product   Jan   Feb   Mar
  &lt;chr&gt;   &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
1 A         500   450   520
2 B         600   700   640
3 C         300   320   310</code></pre>
</div>
</div>
<p>This wide format is easier to read when creating summary reports or comparison tables across months.</p>
</section>
<section id="handling-complex-data-with-missing-values" class="level3">
<h3 class="anchored" data-anchor-id="handling-complex-data-with-missing-values">Handling Complex Data with Missing Values</h3>
<p>Let’s extend the example to include regional sales data with missing values:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb7-1">sales_data <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(</span>
<span id="cb7-2">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">product =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"A"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"A"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"B"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"B"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"C"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"C"</span>),</span>
<span id="cb7-3">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">region =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"North"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"South"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"North"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"South"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"North"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"South"</span>),</span>
<span id="cb7-4">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">Jan =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">500</span>, <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NA</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">600</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">580</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">300</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">350</span>),</span>
<span id="cb7-5">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">Feb =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">450</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">490</span>, <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NA</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">700</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">320</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">400</span>)</span>
<span id="cb7-6">)</span>
<span id="cb7-7">sales_data</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>  product region Jan Feb
1       A  North 500 450
2       A  South  NA 490
3       B  North 600  NA
4       B  South 580 700
5       C  North 300 320
6       C  South 350 400</code></pre>
</div>
</div>
<p>Using <code>pivot_longer()</code>, we can transform this dataset while removing missing values:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb9-1">sales_long <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pivot_longer</span>(sales_data, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">cols =</span> Jan<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>Feb, </span>
<span id="cb9-2">                           <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">names_to =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"month"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">values_to =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"sales"</span>, </span>
<span id="cb9-3">                           <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">values_drop_na =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>)</span>
<span id="cb9-4"></span>
<span id="cb9-5">sales_long</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code># A tibble: 10 × 4
   product region month sales
   &lt;chr&gt;   &lt;chr&gt;  &lt;chr&gt; &lt;dbl&gt;
 1 A       North  Jan     500
 2 A       North  Feb     450
 3 A       South  Feb     490
 4 B       North  Jan     600
 5 B       South  Jan     580
 6 B       South  Feb     700
 7 C       North  Jan     300
 8 C       North  Feb     320
 9 C       South  Jan     350
10 C       South  Feb     400</code></pre>
</div>
</div>
<p>The missing values have been dropped, and the data is now in a form that can be analyzed by month, region, or product.</p>
</section>
</section>
<section id="importance-of-data-transformation-in-visualization" class="level2">
<h2 class="anchored" data-anchor-id="importance-of-data-transformation-in-visualization">Importance of Data Transformation in Visualization</h2>
<p>One of the most significant advantages of transforming data into a long format is the ease of visualizing it. Visualization libraries like <code>ggplot2</code> in R often require data to be in long format for producing detailed and layered charts. For instance, the ability to map different variables to the aesthetics of a plot (such as color, size, or shape) is much simpler with long-format data.</p>
<p>Consider the example of monthly sales data. When the data is in wide format, plotting each product’s sales across months can be cumbersome and limited. However, converting the data into long format allows us to easily generate visualizations that compare sales trends across products and months.</p>
<p>Here’s an example bar plot illustrating the sales data in long format:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb11" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb11-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Gerekli paketleri yükle</span></span>
<span id="cb11-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(tidyr)</span>
<span id="cb11-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(ggplot2)</span>
<span id="cb11-4"></span>
<span id="cb11-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Veri setini oluştur</span></span>
<span id="cb11-6">sales_data <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(</span>
<span id="cb11-7">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">product =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"A"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"B"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"C"</span>),</span>
<span id="cb11-8">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">Jan =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">500</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">600</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">300</span>),</span>
<span id="cb11-9">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">Feb =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">450</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">700</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">320</span>),</span>
<span id="cb11-10">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">Mar =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">520</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">640</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">310</span>)</span>
<span id="cb11-11">)</span>
<span id="cb11-12"></span>
<span id="cb11-13"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Veriyi uzun formata dönüştür</span></span>
<span id="cb11-14">sales_long <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pivot_longer</span>(sales_data, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">cols =</span> Jan<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>Mar, </span>
<span id="cb11-15">                           <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">names_to =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"month"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">values_to =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"sales"</span>)</span>
<span id="cb11-16"></span>
<span id="cb11-17"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Çubuk grafiği oluştur</span></span>
<span id="cb11-18"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggplot</span>(sales_long, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> month, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> sales, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">fill =</span> product)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb11-19">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_bar</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">stat =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"identity"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">position =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"dodge"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb11-20">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">labs</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">title =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Sales Data: Long Format Example"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Month"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Sales"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb11-21">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme_minimal</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb11-22">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">plot.title =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">element_text</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">hjust =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>))</span></code></pre></div></div>
<div class="cell-output-display">
<div>
<figure class="figure">
<p><img src="https://mfatihtuzen.github.io/posts/2024-09-19_pivot/index_files/figure-html/unnamed-chunk-6-1.png" class="img-fluid figure-img" width="672"></p>
</figure>
</div>
</div>
</div>
<ul>
<li><p><strong><code>sales_data</code></strong>: A wide-format dataset containing the sales of products across different months.</p></li>
<li><p><strong><code>pivot_longer()</code></strong>: Used to transform data from a wide format to a long format.</p></li>
<li><p><strong><code>ggplot()</code></strong>: Used to create a bar plot. The <code>aes()</code> function specifies the axes and coloring (for different products).</p></li>
<li><p><strong><code>geom_bar()</code></strong>: Draws the bar plot.</p></li>
<li><p><strong><code>labs()</code></strong>: Adds titles and axis labels.</p></li>
<li><p><strong><code>theme_minimal()</code></strong>: Applies a minimal theme.</p></li>
<li><p><strong><code>position = "dodge"</code></strong>: Draws the bars for products side by side.</p></li>
</ul>
<p>The generated plot would illustrate how <code>pivot_longer()</code> facilitates better visualizations by organizing data in a manner that allows for flexible plotting.</p>
<p><strong>Why Visualization Matters</strong>:</p>
<ul>
<li><p><strong>Clear Insights</strong>: Long format allows better representation of complex relationships.</p></li>
<li><p><strong>Flexible Aesthetics</strong>: With long format data, you can map multiple variables to visual properties (like color or size) more easily.</p></li>
<li><p><strong>Layering Data</strong>: Especially in time-series or categorical data, layering information through visual channels becomes more efficient with long data.</p></li>
</ul>
<p>Without reshaping data, creating advanced visualizations for effective storytelling becomes challenging, making data transformation crucial in exploratory data analysis (EDA) and reporting.</p>
</section>
<section id="importance-in-data-science" class="level2">
<h2 class="anchored" data-anchor-id="importance-in-data-science">Importance in Data Science</h2>
<p>In data science, the ability to reshape data is critical for exploratory data analysis (EDA), feature engineering, and model preparation. Many statistical models and machine learning algorithms expect data in long format, with each observation represented as a row. Converting between formats, especially in the cleaning and pre-processing phase, helps to avoid common errors in analysis, improves the quality of insights, and makes data manipulation more intuitive.</p>
<div class="callout callout-style-default callout-tip callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Tip</span>Alternatives <strong>to <code>pivot_longer()</code> and <code>pivot_wider()</code></strong>
</div>
</div>
<div class="callout-body-container callout-body">
<p>While <code>pivot_longer()</code> and <code>pivot_wider()</code> are part of the <code>tidyr</code> package and are widely used, there are alternative methods for reshaping data in R.</p>
<p>Historically, functions like <code>gather()</code> and <code>spread()</code> from the <code>tidyr</code> package were used for similar tasks before <code>pivot_longer()</code> and <code>pivot_wider()</code> became available. <code>gather()</code> was used to convert data from a wide format to a long format, while <code>spread()</code> was used to convert data from long to wide format. These functions laid the groundwork for the more flexible and consistent <code>pivot_longer()</code> and <code>pivot_wider()</code>.</p>
<p>In addition to <code>pivot_longer()</code> and <code>pivot_wider()</code>, there are alternative methods for reshaping data in R. The <code>reshape2</code> package offers <code>melt()</code> and <code>dcast()</code> functions as older but still functional alternatives for reshaping data. Base R also provides the <code>reshape()</code> function, which is more flexible but less intuitive compared to <code>pivot_longer()</code> and <code>pivot_wider()</code>.</p>
</div>
</div>
</section>
<section id="conclusion" class="level2">
<h2 class="anchored" data-anchor-id="conclusion">Conclusion</h2>
<p>Data transformation using <code>pivot_longer()</code> and <code>pivot_wider()</code> is fundamental in both everyday analysis and more advanced data science tasks. Choosing the correct data structure—whether wide or long—will optimize your workflow, whether you’re modeling, visualizing, or reporting.</p>
<p>The concept of tidy data, which emphasizes a consistent structure where each variable forms a column and each observation forms a row, is crucial in leveraging these functions effectively. By adhering to tidy data principles, you can ensure that your data is well-organized, making it easier to apply transformations and perform analyses. Through <code>pivot_longer()</code> and <code>pivot_wider()</code>, you gain flexibility in reshaping your data to meet the specific needs of your project, facilitating better data manipulation, visualization, and insight extraction.</p>
<p>Understanding when and why to use these transformations, alongside maintaining tidy data practices, will enhance your ability to work with complex datasets and produce meaningful results.</p>
</section>
<section id="references" class="level2">
<h2 class="anchored" data-anchor-id="references">References</h2>
<ol type="1">
<li><p><a href="https://ggplot2-book.org/">Wickham, H. (2016). <em>ggplot2: Elegant Graphics for Data Analysis</em>. Springer-Verlag.</a></p></li>
<li><p><a href="https://adv-r.hadley.nz/">Wickham, H. (2019). <em>Advanced R</em>. Chapman and Hall/CRC.</a></p></li>
<li><p><a href="https://r4ds.hadley.nz/">Wickham, H., Çetinkaya-Rundel, M., &amp; Grolemund, G. (2023). <em>R for data science</em> (2nd ed.). O’Reilly Media.</a></p></li>
</ol>


</section>

 ]]></description>
  <category>R Programming</category>
  <category>tidyr</category>
  <category>pivot_wider</category>
  <category>pivot_longer</category>
  <category>data transformation</category>
  <guid>https://mfatihtuzen.github.io/posts/2024-09-19_pivot/</guid>
  <pubDate>Thu, 19 Sep 2024 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Text Data Analysis in R: Understanding grep, grepl, sub and gsub</title>
  <dc:creator>M. Fatih Tüzen</dc:creator>
  <link>https://mfatihtuzen.github.io/posts/2024-07-09_text_analyze/</link>
  <description><![CDATA[ 






<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://mfatihtuzen.github.io/posts/2024-07-09_text_analyze/text.png" class="img-fluid figure-img"></p>
<figcaption>https://carlalexander.ca/beginners-guide-regular-expressions/</figcaption>
</figure>
</div>
<section id="introduction" class="level2">
<h2 class="anchored" data-anchor-id="introduction">Introduction</h2>
<p>In text data analysis, being able to search for patterns, validate their existence, and perform substitutions is crucial. R provides powerful base functions like <code>grep</code>, <code>grepl</code>, <code>sub</code>, and <code>gsub</code> to handle these tasks efficiently. This blog post will delve into how these functions work, using examples ranging from simple to complex, to show how they can be leveraged for text manipulation, classification, and grouping tasks.</p>
</section>
<section id="understanding-grep-and-grepl" class="level2">
<h2 class="anchored" data-anchor-id="understanding-grep-and-grepl">1. Understanding <code>grep</code> and <code>grepl</code></h2>
<section id="what-is-grep" class="level3">
<h3 class="anchored" data-anchor-id="what-is-grep">What is <code>grep</code>?</h3>
<ul>
<li><p><strong>Functionality:</strong> Searches for matches to a specified pattern in a vector of character strings.</p></li>
<li><p><strong>Usage:</strong> <code>grep(pattern, x, ...)</code></p></li>
<li><p><strong>Example:</strong> Searching for specific words or patterns in text.</p></li>
</ul>
</section>
<section id="what-is-grepl" class="level3">
<h3 class="anchored" data-anchor-id="what-is-grepl">What is <code>grepl</code>?</h3>
<ul>
<li><p><strong>Functionality:</strong> Returns a logical vector indicating whether a pattern is found in each element of a character vector.</p></li>
<li><p><strong>Usage:</strong> <code>grepl(pattern, x, ...)</code></p></li>
<li><p><strong>Example:</strong> Checking if specific patterns exist in text data.</p></li>
</ul>
</section>
<section id="differences-advantages-and-disadvantages" class="level3">
<h3 class="anchored" data-anchor-id="differences-advantages-and-disadvantages">Differences, Advantages, and Disadvantages</h3>
<ul>
<li><p><strong>Differences:</strong> <code>grep</code> returns indices or values matching the pattern, while <code>grepl</code> returns a logical vector.</p></li>
<li><p><strong>Advantages:</strong> Fast pattern matching over large datasets.</p></li>
<li><p><strong>Disadvantages:</strong> Exact matching without inherent flexibility for complex patterns.</p></li>
</ul>
</section>
</section>
<section id="using-sub-and-gsub-for-text-substitution" class="level2">
<h2 class="anchored" data-anchor-id="using-sub-and-gsub-for-text-substitution">2. Using <code>sub</code> and <code>gsub</code> for Text Substitution</h2>
<section id="what-is-sub" class="level3">
<h3 class="anchored" data-anchor-id="what-is-sub">What is <code>sub</code>?</h3>
<ul>
<li><p><strong>Functionality:</strong> Replaces the first occurrence of a pattern in a string.</p></li>
<li><p><strong>Usage:</strong> <code>sub(pattern, replacement, x, ...)</code></p></li>
<li><p><strong>Example:</strong> Substituting specific patterns with another string.</p></li>
</ul>
</section>
<section id="what-is-gsub" class="level3">
<h3 class="anchored" data-anchor-id="what-is-gsub">What is <code>gsub</code>?</h3>
<ul>
<li><p><strong>Functionality:</strong> Replaces all occurrences of a pattern in a string.</p></li>
<li><p><strong>Usage:</strong> <code>gsub(pattern, replacement, x, ...)</code></p></li>
<li><p><strong>Example:</strong> Global substitution of patterns throughout text data.</p></li>
</ul>
</section>
<section id="differences-advantages-and-disadvantages-1" class="level3">
<h3 class="anchored" data-anchor-id="differences-advantages-and-disadvantages-1">Differences, Advantages, and Disadvantages</h3>
<ul>
<li><p><strong>Differences:</strong> <code>sub</code> replaces only the first occurrence, while <code>gsub</code> replaces all occurrences.</p></li>
<li><p><strong>Advantages:</strong> Efficient for bulk text replacements.</p></li>
<li><p><strong>Disadvantages:</strong> Lack of advanced pattern matching features compared to other libraries.</p></li>
</ul>
</section>
</section>
<section id="practical-examples-with-a-synthetic-dataset" class="level2">
<h2 class="anchored" data-anchor-id="practical-examples-with-a-synthetic-dataset">3. Practical Examples with a Synthetic Dataset</h2>
<section id="example-dataset" class="level3">
<h3 class="anchored" data-anchor-id="example-dataset">Example Dataset</h3>
<p>For the purposes of this blog post, we’ll create a synthetic dataset. This dataset is a data frame that contains two columns: <code>id</code> and <code>text</code>. Each row represents a unique text entry with a corresponding identifier.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Creating a synthetic data frame</span></span>
<span id="cb1-2">text_data <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(</span>
<span id="cb1-3">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">id =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">15</span>,</span>
<span id="cb1-4">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">text =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Cats are great pets."</span>,</span>
<span id="cb1-5">           <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Dogs are loyal animals."</span>,</span>
<span id="cb1-6">           <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Birds can fly high."</span>,</span>
<span id="cb1-7">           <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Fish swim in water."</span>,</span>
<span id="cb1-8">           <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Horses run fast."</span>,</span>
<span id="cb1-9">           <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Rabbits hop quickly."</span>,</span>
<span id="cb1-10">           <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Cows give milk."</span>,</span>
<span id="cb1-11">           <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Sheep have wool."</span>,</span>
<span id="cb1-12">           <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Goats are curious creatures."</span>,</span>
<span id="cb1-13">           <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Lions are the kings of the jungle."</span>,</span>
<span id="cb1-14">           <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Tigers have stripes."</span>,</span>
<span id="cb1-15">           <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Elephants are large animals."</span>,</span>
<span id="cb1-16">           <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Monkeys are very playful."</span>,</span>
<span id="cb1-17">           <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Giraffes have long necks."</span>,</span>
<span id="cb1-18">           <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Zebras have black and white stripes."</span>)</span>
<span id="cb1-19">)</span></code></pre></div></div>
</div>
</section>
<section id="explanation-of-the-dataset" class="level3">
<h3 class="anchored" data-anchor-id="explanation-of-the-dataset">Explanation of the Dataset</h3>
<ul>
<li><p><strong><code>id</code> Column:</strong> This is a simple identifier for each row, ranging from 1 to 15.</p></li>
<li><p><strong><code>text</code> Column:</strong> This contains various sentences about different animals. Each text string is unique and describes a characteristic or trait of the animal mentioned.</p></li>
</ul>
</section>
<section id="applying-grep-grepl-sub-and-gsub" class="level3">
<h3 class="anchored" data-anchor-id="applying-grep-grepl-sub-and-gsub">Applying <code>grep</code>, <code>grepl</code>, <code>sub</code>, and <code>gsub</code></h3>
<section id="example-1-using-grep-to-find-specific-words" class="level4">
<h4 class="anchored" data-anchor-id="example-1-using-grep-to-find-specific-words">Example 1: Using <code>grep</code> to find specific words</h4>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb2-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Find rows containing the word 'are'</span></span>
<span id="cb2-2">indices <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">grep</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"are"</span>, text_data<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>text, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">ignore.case =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>)</span>
<span id="cb2-3">result_grep <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> text_data[indices, ]</span>
<span id="cb2-4">result_grep</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>   id                               text
1   1               Cats are great pets.
2   2            Dogs are loyal animals.
9   9       Goats are curious creatures.
10 10 Lions are the kings of the jungle.
12 12       Elephants are large animals.
13 13          Monkeys are very playful.</code></pre>
</div>
</div>
<p><strong>Explanation:</strong> <code>grep("are", text_data$text, ignore.case = TRUE)</code> searches for the word “are” in the <code>text</code> column of <code>text_data</code>, ignoring case, and returns the indices of the matching rows. The resulting rows will be displayed.</p>
</section>
<section id="example-2-applying-grepl-for-conditional-checks" class="level4">
<h4 class="anchored" data-anchor-id="example-2-applying-grepl-for-conditional-checks">Example 2: Applying <code>grepl</code> for conditional checks</h4>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb4-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Add a new column indicating if the word 'fly' is present</span></span>
<span id="cb4-2"></span>
<span id="cb4-3">text_data<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>contains_fly <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">grepl</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"fly"</span>, text_data<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>text)</span>
<span id="cb4-4">text_data</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>   id                                 text contains_fly
1   1                 Cats are great pets.        FALSE
2   2              Dogs are loyal animals.        FALSE
3   3                  Birds can fly high.         TRUE
4   4                  Fish swim in water.        FALSE
5   5                     Horses run fast.        FALSE
6   6                 Rabbits hop quickly.        FALSE
7   7                      Cows give milk.        FALSE
8   8                     Sheep have wool.        FALSE
9   9         Goats are curious creatures.        FALSE
10 10   Lions are the kings of the jungle.        FALSE
11 11                 Tigers have stripes.        FALSE
12 12         Elephants are large animals.        FALSE
13 13            Monkeys are very playful.        FALSE
14 14            Giraffes have long necks.        FALSE
15 15 Zebras have black and white stripes.        FALSE</code></pre>
</div>
</div>
<p><strong>Explanation:</strong> <code>grepl("fly", text_data$text)</code> checks each element of the <code>text</code> column for the presence of the word “fly” and returns a logical vector. This vector is then added as a new column <code>contains_fly</code>.</p>
</section>
<section id="example-3-using-sub-to-replace-a-pattern-in-text" class="level4">
<h4 class="anchored" data-anchor-id="example-3-using-sub-to-replace-a-pattern-in-text">Example 3: Using <code>sub</code> to replace a pattern in text</h4>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb6-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Replace the first occurrence of 'a' with 'A' in the text column</span></span>
<span id="cb6-2"></span>
<span id="cb6-3">text_data<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>text_sub <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sub</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">" a "</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">" A "</span>, text_data<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>text)</span>
<span id="cb6-4">text_data[,<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"text"</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"text_sub"</span>)]</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>                                   text                             text_sub
1                  Cats are great pets.                 Cats are great pets.
2               Dogs are loyal animals.              Dogs are loyal animals.
3                   Birds can fly high.                  Birds can fly high.
4                   Fish swim in water.                  Fish swim in water.
5                      Horses run fast.                     Horses run fast.
6                  Rabbits hop quickly.                 Rabbits hop quickly.
7                       Cows give milk.                      Cows give milk.
8                      Sheep have wool.                     Sheep have wool.
9          Goats are curious creatures.         Goats are curious creatures.
10   Lions are the kings of the jungle.   Lions are the kings of the jungle.
11                 Tigers have stripes.                 Tigers have stripes.
12         Elephants are large animals.         Elephants are large animals.
13            Monkeys are very playful.            Monkeys are very playful.
14            Giraffes have long necks.            Giraffes have long necks.
15 Zebras have black and white stripes. Zebras have black and white stripes.</code></pre>
</div>
</div>
<p><strong>Explanation:</strong> <code>sub(" a ", " A ", text_data$text)</code> replaces the first occurrence of ’ a ’ with ’ A ’ in each element of the <code>text</code> column. The resulting text is stored in a new column <code>text_sub</code>.</p>
</section>
<section id="example-4-applying-gsub-for-global-pattern-replacement" class="level4">
<h4 class="anchored" data-anchor-id="example-4-applying-gsub-for-global-pattern-replacement">Example 4: Applying <code>gsub</code> for global pattern replacement</h4>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb8-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Replace all occurrences of 'a' with 'A' in the text column</span></span>
<span id="cb8-2"></span>
<span id="cb8-3">text_data<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>text_gsub <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">gsub</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">" a "</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">" A "</span>, text_data<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>text)</span>
<span id="cb8-4">text_data[,<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"text"</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"text_gsub"</span>)]</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>                                   text                            text_gsub
1                  Cats are great pets.                 Cats are great pets.
2               Dogs are loyal animals.              Dogs are loyal animals.
3                   Birds can fly high.                  Birds can fly high.
4                   Fish swim in water.                  Fish swim in water.
5                      Horses run fast.                     Horses run fast.
6                  Rabbits hop quickly.                 Rabbits hop quickly.
7                       Cows give milk.                      Cows give milk.
8                      Sheep have wool.                     Sheep have wool.
9          Goats are curious creatures.         Goats are curious creatures.
10   Lions are the kings of the jungle.   Lions are the kings of the jungle.
11                 Tigers have stripes.                 Tigers have stripes.
12         Elephants are large animals.         Elephants are large animals.
13            Monkeys are very playful.            Monkeys are very playful.
14            Giraffes have long necks.            Giraffes have long necks.
15 Zebras have black and white stripes. Zebras have black and white stripes.</code></pre>
</div>
</div>
<p><strong>Explanation:</strong> <code>gsub(" a ", " A ", text_data$text)</code> replaces all occurrences of ’ a ’ with ’ A ’ in each element of the <code>text</code> column. The resulting text is stored in a new column <code>text_gsub</code>.</p>
</section>
</section>
<section id="example-5-text-based-grouping-and-assignment" class="level3">
<h3 class="anchored" data-anchor-id="example-5-text-based-grouping-and-assignment">Example 5: Text-based Grouping and Assignment</h3>
<p>Let’s group the texts based on the presence of the word “bird” and assign a category.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb10" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb10-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Add a new column 'category' based on the presence of the word 'fly'</span></span>
<span id="cb10-2"></span>
<span id="cb10-3">text_data<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>category <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ifelse</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">grepl</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"fly"</span>, text_data<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>text, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">ignore.case =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>), <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Can Fly"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Cannot Fly"</span>)</span>
<span id="cb10-4">text_data[,<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"text"</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"category"</span>)]</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>                                   text   category
1                  Cats are great pets. Cannot Fly
2               Dogs are loyal animals. Cannot Fly
3                   Birds can fly high.    Can Fly
4                   Fish swim in water. Cannot Fly
5                      Horses run fast. Cannot Fly
6                  Rabbits hop quickly. Cannot Fly
7                       Cows give milk. Cannot Fly
8                      Sheep have wool. Cannot Fly
9          Goats are curious creatures. Cannot Fly
10   Lions are the kings of the jungle. Cannot Fly
11                 Tigers have stripes. Cannot Fly
12         Elephants are large animals. Cannot Fly
13            Monkeys are very playful. Cannot Fly
14            Giraffes have long necks. Cannot Fly
15 Zebras have black and white stripes. Cannot Fly</code></pre>
</div>
</div>
<p><strong>Explanation:</strong> <code>grepl("fly", text_data$text, ignore.case = TRUE)</code> checks for the presence of the word “fly” in each element of the <code>text</code> column, ignoring case. The <code>ifelse</code> function is then used to create a new column <code>category</code>, assigning “Can Fly” if the word is present and “Cannot Fly” otherwise.</p>
</section>
<section id="additional-examples" class="level3">
<h3 class="anchored" data-anchor-id="additional-examples">Additional Examples</h3>
<section id="example-6-using-grep-to-find-multiple-patterns" class="level4">
<h4 class="anchored" data-anchor-id="example-6-using-grep-to-find-multiple-patterns">Example 6: Using <code>grep</code> to find multiple patterns</h4>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb12" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb12-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Find rows containing the words 'great' or 'loyal'</span></span>
<span id="cb12-2">indices <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">grep</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"great|loyal"</span>, text_data<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>text, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">ignore.case =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>)</span>
<span id="cb12-3">text_data[indices,<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"text"</span>) ]</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] "Cats are great pets."    "Dogs are loyal animals."</code></pre>
</div>
</div>
<p><strong>Explanation:</strong> <code>grep("great|loyal", text_data$text, ignore.case = TRUE)</code> searches for the words “great” or “loyal” in the <code>text</code> column, ignoring case, and returns the indices of the matching rows. The resulting rows will be displayed.</p>
</section>
<section id="example-7-using-gsub-for-complex-substitutions" class="level4">
<h4 class="anchored" data-anchor-id="example-7-using-gsub-for-complex-substitutions">Example 7: Using <code>gsub</code> for complex substitutions</h4>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb14" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb14-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Replace all occurrences of 'animals' with 'creatures' and 'pets' with 'companions'</span></span>
<span id="cb14-2"></span>
<span id="cb14-3">text_data<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>text_gsub_complex <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">gsub</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"animals"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"creatures"</span>, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">gsub</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"pets"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"companions"</span>, text_data<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>text))</span>
<span id="cb14-4">text_data[,<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"text"</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"text_gsub_complex"</span>)]</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>                                   text                    text_gsub_complex
1                  Cats are great pets.           Cats are great companions.
2               Dogs are loyal animals.            Dogs are loyal creatures.
3                   Birds can fly high.                  Birds can fly high.
4                   Fish swim in water.                  Fish swim in water.
5                      Horses run fast.                     Horses run fast.
6                  Rabbits hop quickly.                 Rabbits hop quickly.
7                       Cows give milk.                      Cows give milk.
8                      Sheep have wool.                     Sheep have wool.
9          Goats are curious creatures.         Goats are curious creatures.
10   Lions are the kings of the jungle.   Lions are the kings of the jungle.
11                 Tigers have stripes.                 Tigers have stripes.
12         Elephants are large animals.       Elephants are large creatures.
13            Monkeys are very playful.            Monkeys are very playful.
14            Giraffes have long necks.            Giraffes have long necks.
15 Zebras have black and white stripes. Zebras have black and white stripes.</code></pre>
</div>
</div>
<p><strong>Explanation:</strong> The inner <code>gsub</code> replaces all occurrences of ‘pets’ with ‘companions’, and the outer <code>gsub</code> replaces all occurrences of ‘animals’ with ‘creatures’ in each element of the <code>text</code> column. The resulting text is stored in a new column <code>text_gsub_complex</code>.</p>
</section>
<section id="example-8-using-grepl-with-multiple-conditions" class="level4">
<h4 class="anchored" data-anchor-id="example-8-using-grepl-with-multiple-conditions">Example 8: Using <code>grepl</code> with multiple conditions</h4>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb16" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb16-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Add a new column indicating if the text contains either 'large' or 'playful'</span></span>
<span id="cb16-2"></span>
<span id="cb16-3">text_data<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>contains_large_or_playful <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">grepl</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"large|playful"</span>, text_data<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>text)</span>
<span id="cb16-4">text_data[,<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"text"</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"contains_large_or_playful"</span>)]</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>                                   text contains_large_or_playful
1                  Cats are great pets.                     FALSE
2               Dogs are loyal animals.                     FALSE
3                   Birds can fly high.                     FALSE
4                   Fish swim in water.                     FALSE
5                      Horses run fast.                     FALSE
6                  Rabbits hop quickly.                     FALSE
7                       Cows give milk.                     FALSE
8                      Sheep have wool.                     FALSE
9          Goats are curious creatures.                     FALSE
10   Lions are the kings of the jungle.                     FALSE
11                 Tigers have stripes.                     FALSE
12         Elephants are large animals.                      TRUE
13            Monkeys are very playful.                      TRUE
14            Giraffes have long necks.                     FALSE
15 Zebras have black and white stripes.                     FALSE</code></pre>
</div>
</div>
<p><strong>Explanation:</strong> <code>grepl("large|playful", text_data$text)</code> checks each element of the <code>text</code> column for the presence of the words “large” or “playful” and returns a logical vector. This vector is then added as a new column <code>contains_large_or_playful</code>.</p>
</section>
</section>
</section>
<section id="understanding-regular-expressions" class="level2">
<h2 class="anchored" data-anchor-id="understanding-regular-expressions">4. Understanding Regular Expressions</h2>
<p>Regular expressions (regex) are powerful tools used for pattern matching and text manipulation. They allow you to define complex search patterns using a combination of literal characters and special symbols. R’s <code>grep</code>, <code>grepl</code>, <code>sub</code>, and <code>gsub</code> functions all support the use of regular expressions.</p>
<section id="key-components-of-regular-expressions" class="level3">
<h3 class="anchored" data-anchor-id="key-components-of-regular-expressions">Key Components of Regular Expressions</h3>
<ul>
<li><p><strong>Literal Characters:</strong> These are the basic building blocks of regex. For example, <code>cat</code> matches the string “cat”.</p></li>
<li><p><strong>Metacharacters:</strong> Special characters with unique meanings, such as <code>^</code>, <code>$</code>, <code>.</code>, <code>*</code>, <code>+</code>, <code>?</code>, <code>|</code>, <code>[]</code>, <code>()</code>, <code>{}</code></p>
<ul>
<li><p><code>^</code> matches the start of a string.</p></li>
<li><p><code>$</code> matches the end of a string.</p></li>
<li><p><code>.</code> matches any single character except a newline.</p></li>
<li><p><code>*</code> matches zero or more occurrences of the preceding element.</p></li>
<li><p><code>+</code> matches one or more occurrences of the preceding element.</p></li>
<li><p><code>?</code> matches zero or one occurrence of the preceding element.</p></li>
<li><p><code>|</code> denotes alternation (or).</p></li>
<li><p><code>[]</code> matches any one of the characters inside the brackets.</p></li>
<li><p><code>()</code> groups elements together.</p></li>
<li><p><code>{}</code> specifies a specific number of occurrences.</p></li>
</ul></li>
</ul>
</section>
<section id="examples-with-regular-expressions" class="level3">
<h3 class="anchored" data-anchor-id="examples-with-regular-expressions">Examples with Regular Expressions</h3>
<p>Using the same synthetic dataset, let’s explore how to apply regular expressions with <code>grep</code>, <code>grepl</code>, <code>sub</code>, and <code>gsub</code>.</p>
<section id="example-1-matching-text-that-starts-with-a-specific-word" class="level4">
<h4 class="anchored" data-anchor-id="example-1-matching-text-that-starts-with-a-specific-word">Example 1: Matching Text that Starts with a Specific Word</h4>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb18" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb18-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Find rows where text starts with the word 'Cats'</span></span>
<span id="cb18-2">indices <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">grep</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"^Cats"</span>, text_data<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>text)</span>
<span id="cb18-3">text_data[indices,<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"text"</span>)]</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] "Cats are great pets."</code></pre>
</div>
</div>
<p><strong>Explanation:</strong> <code>grep("^Cats", text_data$text)</code> uses the <code>^</code> metacharacter to find rows where the text starts with “Cats”.</p>
</section>
<section id="example-2-matching-text-that-ends-with-a-specific-word" class="level4">
<h4 class="anchored" data-anchor-id="example-2-matching-text-that-ends-with-a-specific-word">Example 2: Matching Text that Ends with a Specific Word</h4>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb20" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb20-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Find rows where text ends with the word 'water.'</span></span>
<span id="cb20-2">indices <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">grep</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"water</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">\\</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">.$"</span>, text_data<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>text)</span>
<span id="cb20-3">text_data[indices,<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"text"</span>)]</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] "Fish swim in water."</code></pre>
</div>
</div>
<p><strong>Explanation:</strong> <code>grep("water\\.$", text_data$text)</code> uses the <code>$</code> metacharacter to find rows where the text ends with “water.” The <code>\\.</code> is used to escape the dot character, which is a metacharacter in regex.</p>
</section>
<section id="example-3-matching-text-that-contains-a-specific-pattern" class="level4">
<h4 class="anchored" data-anchor-id="example-3-matching-text-that-contains-a-specific-pattern">Example 3: Matching Text that Contains a Specific Pattern</h4>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb22" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb22-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Find rows where text contains 'great' followed by any character and 'pets'</span></span>
<span id="cb22-2">indices <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">grep</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"great.pets"</span>, text_data<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>text)</span>
<span id="cb22-3">text_data[indices,<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"text"</span>)]</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] "Cats are great pets."</code></pre>
</div>
</div>
<p><strong>Explanation:</strong> <code>grep("great.pets", text_data$text)</code> uses the <code>.</code> metacharacter to match any character between “great” and “pets”.</p>
</section>
</section>
<section id="example-4-using-gsub-with-regular-expressions" class="level3">
<h3 class="anchored" data-anchor-id="example-4-using-gsub-with-regular-expressions">Example 4: Using <code>gsub</code> with Regular Expressions</h3>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb24" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb24-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Replace all occurrences of words starting with 'C' with 'Animal'</span></span>
<span id="cb24-2">text_data<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>text_gsub_regex <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">gsub</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">\\</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">bC</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">\\</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">w+"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Animal"</span>, text_data<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>text)</span>
<span id="cb24-3">text_data[,<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"text"</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"text_gsub_regex"</span>)]</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>                                   text                      text_gsub_regex
1                  Cats are great pets.               Animal are great pets.
2               Dogs are loyal animals.              Dogs are loyal animals.
3                   Birds can fly high.                  Birds can fly high.
4                   Fish swim in water.                  Fish swim in water.
5                      Horses run fast.                     Horses run fast.
6                  Rabbits hop quickly.                 Rabbits hop quickly.
7                       Cows give milk.                    Animal give milk.
8                      Sheep have wool.                     Sheep have wool.
9          Goats are curious creatures.         Goats are curious creatures.
10   Lions are the kings of the jungle.   Lions are the kings of the jungle.
11                 Tigers have stripes.                 Tigers have stripes.
12         Elephants are large animals.         Elephants are large animals.
13            Monkeys are very playful.            Monkeys are very playful.
14            Giraffes have long necks.            Giraffes have long necks.
15 Zebras have black and white stripes. Zebras have black and white stripes.</code></pre>
</div>
</div>
<p><strong>Explanation:</strong> <code>gsub("\\bC\\w+", "Animal", text_data$text)</code> replaces all words starting with ‘C’ (<code>\\b</code> indicates a word boundary, <code>C</code> matches the character ‘C’, and <code>\\w+</code> matches one or more word characters) with “Animal”.</p>
<section id="example-5-using-grepl-to-check-for-complex-patterns" class="level4">
<h4 class="anchored" data-anchor-id="example-5-using-grepl-to-check-for-complex-patterns">Example 5: Using <code>grepl</code> to Check for Complex Patterns</h4>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb26" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb26-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Add a new column indicating if the text contains a word ending with 's'</span></span>
<span id="cb26-2">text_data<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>contains_s_end <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">grepl</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">\\</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">b</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">\\</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">w+s</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">\\</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">b"</span>, text_data<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>text)</span>
<span id="cb26-3">text_data[,<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"text"</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"contains_s_end"</span>)]</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>                                   text contains_s_end
1                  Cats are great pets.           TRUE
2               Dogs are loyal animals.           TRUE
3                   Birds can fly high.           TRUE
4                   Fish swim in water.          FALSE
5                      Horses run fast.           TRUE
6                  Rabbits hop quickly.           TRUE
7                       Cows give milk.           TRUE
8                      Sheep have wool.          FALSE
9          Goats are curious creatures.           TRUE
10   Lions are the kings of the jungle.           TRUE
11                 Tigers have stripes.           TRUE
12         Elephants are large animals.           TRUE
13            Monkeys are very playful.           TRUE
14            Giraffes have long necks.           TRUE
15 Zebras have black and white stripes.           TRUE</code></pre>
</div>
</div>
<p><strong>Explanation:</strong> <code>grepl("\\b\\w+s\\b", text_data$text)</code> checks each element of the <code>text</code> column for the presence of a word ending with ‘s’. Here, <code>\\b</code> indicates a word boundary, <code>\\w+</code> matches one or more word characters, and <code>s</code> matches the character ‘s’.</p>
</section>
</section>
</section>
<section id="conclusion" class="level2">
<h2 class="anchored" data-anchor-id="conclusion">Conclusion</h2>
<p>The <code>grep</code>, <code>grepl</code>, <code>sub</code>, and <code>gsub</code> functions in R are powerful tools for text data analysis. They allow for efficient searching, pattern matching, and text manipulation, making them essential for any data analyst or data scientist working with textual data. By understanding how to use these functions and leveraging regular expressions, you can perform a wide range of text processing tasks, from simple searches to complex pattern replacements and text-based classifications.</p>


</section>

 ]]></description>
  <category>R Programming</category>
  <category>grep</category>
  <category>grepl</category>
  <category>sub</category>
  <category>gsub</category>
  <category>regex</category>
  <category>text analysis</category>
  <guid>https://mfatihtuzen.github.io/posts/2024-07-09_text_analyze/</guid>
  <pubDate>Tue, 09 Jul 2024 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Exploring apply, sapply, lapply, and map Functions in R</title>
  <dc:creator>M. Fatih Tüzen</dc:creator>
  <link>https://mfatihtuzen.github.io/posts/2024-04-15_apply_map/</link>
  <description><![CDATA[ 






<section id="introduction" class="level2">
<h2 class="anchored" data-anchor-id="introduction"><strong>Introduction</strong></h2>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><a href="https://www.tumblr.com/jake-clark/100946716432?source=share"><img src="https://mfatihtuzen.github.io/posts/2024-04-15_apply_map/apply_map.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></a></p>
</figure>
</div>
<p>In R programming, Apply functions (<strong><code>apply()</code></strong>, <strong><code>sapply()</code></strong>, <strong><code>lapply()</code></strong>) and the <strong><code>map()</code></strong> function from the purrr package are powerful tools for data manipulation and analysis. In this comprehensive guide, we will delve into the syntax, usage, and examples of each function, including the usage of built-in functions and additional arguments, as well as performance benchmarking.</p>
</section>
<section id="understanding-apply-function" class="level2">
<h2 class="anchored" data-anchor-id="understanding-apply-function">Understanding apply() Function</h2>
<p>The <code>apply()</code> function in R is used to apply a specified function to the rows or columns of an array. Its syntax is as follows:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">apply</span>(X, MARGIN, FUN, ...)</span></code></pre></div></div>
</div>
<ul>
<li><p><strong><code>X</code></strong>: The input data, typically an array or matrix.</p></li>
<li><p><strong><code>MARGIN</code></strong>: A numeric vector indicating which margins should be retained. Use <strong><code>1</code></strong> for rows, <strong><code>2</code></strong> for columns.</p></li>
<li><p><strong><code>FUN</code></strong>: The function to apply.</p></li>
<li><p><strong><code>...</code></strong>: Additional arguments to be passed to the function.</p></li>
</ul>
<p>Let’s calculate the mean of each row in a matrix using <strong><code>apply()</code></strong>:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb2-1">matrix_data <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">matrix</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">9</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">nrow =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>)</span>
<span id="cb2-2">row_means <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">apply</span>(matrix_data, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, mean)</span>
<span id="cb2-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">print</span>(row_means)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 4 5 6</code></pre>
</div>
</div>
<p>This example computes the mean of each row in the matrix.</p>
<p>Let’s calculate the standard deviation of each column in a matrix and specify additional arguments (<strong><code>na.rm = TRUE</code></strong>) using <strong><code>apply()</code></strong>:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb4-1">column_stdev <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">apply</span>(matrix_data, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, sd, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">na.rm =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>)</span>
<span id="cb4-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">print</span>(column_stdev)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 1 1 1</code></pre>
</div>
</div>
</section>
<section id="understanding-sapply-function" class="level2">
<h2 class="anchored" data-anchor-id="understanding-sapply-function">Understanding sapply() Function</h2>
<p>The <strong><code>sapply()</code></strong> function is a simplified version of <strong><code>lapply()</code></strong> that returns a vector or matrix. Its syntax is similar to <strong><code>lapply()</code></strong>:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb6-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sapply</span>(X, FUN, ...)</span></code></pre></div></div>
</div>
<ul>
<li><p><strong><code>X</code></strong>: The input data, typically a list.</p></li>
<li><p><strong><code>FUN</code></strong>: The function to apply.</p></li>
<li><p><strong><code>...</code></strong>: Additional arguments to be passed to the function.</p></li>
</ul>
<p>Let’s calculate the sum of each element in a list using <strong><code>sapply()</code></strong>:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb7-1">num_list <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">list</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">a =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">b =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">6</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">c =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">7</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">9</span>)</span>
<span id="cb7-2">sum_results <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sapply</span>(num_list, sum)</span>
<span id="cb7-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">print</span>(sum_results)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code> a  b  c 
 6 15 24 </code></pre>
</div>
</div>
<p>This example computes the sum of each element in the list.</p>
<p>Let’s convert each element in a list to uppercase using <strong><code>sapply()</code></strong> and the <strong><code>toupper()</code></strong> function:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb9-1">text_list <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">list</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"hello"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"world"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"R"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"programming"</span>)</span>
<span id="cb9-2">uppercase_text <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sapply</span>(text_list, toupper)</span>
<span id="cb9-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">print</span>(uppercase_text)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] "HELLO"       "WORLD"       "R"           "PROGRAMMING"</code></pre>
</div>
</div>
<p>Here, <strong><code>sapply()</code></strong> applies the <strong><code>toupper()</code></strong> function to each element in the list, converting them to uppercase.</p>
</section>
<section id="understanding-lapply-function" class="level2">
<h2 class="anchored" data-anchor-id="understanding-lapply-function">Understanding lapply() Function</h2>
<p>The <strong><code>lapply()</code></strong> function applies a function to each element of a list and returns a list. Its syntax is as follows:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb11" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb11-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lapply</span>(X, FUN, ...)</span></code></pre></div></div>
</div>
<ul>
<li><p><strong><code>X</code></strong>: The input data, typically a list.</p></li>
<li><p><strong><code>FUN</code></strong>: The function to apply.</p></li>
<li><p><strong><code>...</code></strong>: Additional arguments to be passed to the function.</p></li>
</ul>
<p>Let’s apply a custom function to each element of a list using <strong><code>lapply()</code></strong>:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb12" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb12-1">num_list <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">list</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">a =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">b =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">6</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">c =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">7</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">9</span>)</span>
<span id="cb12-2">custom_function <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(x) <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sum</span>(x) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span></span>
<span id="cb12-3">result_list <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lapply</span>(num_list, custom_function)</span>
<span id="cb12-4"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">print</span>(result_list)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>$a
[1] 12

$b
[1] 30

$c
[1] 48</code></pre>
</div>
</div>
<p>In this example, <strong><code>lapply()</code></strong> applies the custom function to each element in the list.</p>
<p>Let’s extract the vowels from each element in a list of words using <strong><code>lapply()</code></strong> and a custom function:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb14" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb14-1">word_list <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">list</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"apple"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"banana"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"orange"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"grape"</span>)</span>
<span id="cb14-2">vowel_list <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lapply</span>(word_list, <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(word) <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">grep</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"[aeiou]"</span>, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">strsplit</span>(word, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">""</span>)[[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]], <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">value =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>))</span>
<span id="cb14-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">print</span>(vowel_list)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[[1]]
[1] "a" "e"

[[2]]
[1] "a" "a" "a"

[[3]]
[1] "o" "a" "e"

[[4]]
[1] "a" "e"</code></pre>
</div>
</div>
<p>Here, <strong><code>lapply()</code></strong> applies the custom function to each element in the list, extracting vowels from words.</p>
</section>
<section id="understanding-map-function" class="level2">
<h2 class="anchored" data-anchor-id="understanding-map-function">Understanding map() Function</h2>
<p>The <strong><code>map()</code></strong> function from the purrr package is similar to <strong><code>lapply()</code></strong> but offers a more consistent syntax and returns a list. Its syntax is as follows:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb16" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb16-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">map</span>(.x, .f, ...)</span></code></pre></div></div>
</div>
<ul>
<li><p><strong><code>.x</code></strong>: The input data, typically a list.</p></li>
<li><p><strong><code>.f</code></strong>: The function to apply.</p></li>
<li><p><strong><code>...</code></strong>: Additional arguments to be passed to the function.</p></li>
</ul>
<p>Let’s apply a lambda function to each element of a list using <strong><code>map()</code></strong>:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb17" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb17-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(purrr)</span>
<span id="cb17-2">num_list <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">list</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">a =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">b =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">6</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">c =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">7</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">9</span>)</span>
<span id="cb17-3">mapped_results <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">map</span>(num_list, <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> .x<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">^</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>)</span>
<span id="cb17-4"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">print</span>(mapped_results)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>$a
[1] 1 4 9

$b
[1] 16 25 36

$c
[1] 49 64 81</code></pre>
</div>
</div>
<p>In this example, <strong><code>map()</code></strong> applies the lambda function (squared) to each element in the list.</p>
<p>Let’s calculate the lengths of strings in a list using <strong><code>map()</code></strong> and the <strong><code>nchar()</code></strong> function:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb19" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb19-1">text_list <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">list</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"hello"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"world"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"R"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"programming"</span>)</span>
<span id="cb19-2">string_lengths <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">map</span>(text_list, nchar)</span>
<span id="cb19-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">print</span>(string_lengths)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[[1]]
[1] 5

[[2]]
[1] 5

[[3]]
[1] 1

[[4]]
[1] 11</code></pre>
</div>
</div>
<p>Here, <strong><code>map()</code></strong> applies the <strong><code>nchar()</code></strong> function to each element in the list, calculating the length of each string.</p>
</section>
<section id="understanding-map-function-variants" class="level2">
<h2 class="anchored" data-anchor-id="understanding-map-function-variants">Understanding map() Function Variants</h2>
<p>In addition to the <strong><code>map()</code></strong> function, the purrr package provides several variants that are specialized for different types of output: <strong><code>map_lgl()</code></strong>, <strong><code>map_int()</code></strong>, <strong><code>map_dbl()</code></strong>, and <strong><code>map_chr()</code></strong>. These variants are particularly useful when you expect the output to be of a specific data type, such as logical, integer, double, or character.</p>
<ul>
<li><p><strong><code>map_lgl()</code></strong>: This variant is used when the output of the function is expected to be a logical vector.</p></li>
<li><p><strong><code>map_int()</code></strong>: Use this variant when the output of the function is expected to be an integer vector.</p></li>
<li><p><strong><code>map_dbl()</code></strong>: This variant is used when the output of the function is expected to be a double vector.</p></li>
<li><p><strong><code>map_chr()</code></strong>: Use this variant when the output of the function is expected to be a character vector.</p></li>
</ul>
<p>These variants provide stricter type constraints compared to the generic <strong><code>map()</code></strong> function, which can be useful for ensuring the consistency of the output type across iterations. They are particularly handy when working with functions that have predictable output types.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb21" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb21-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(purrr)</span>
<span id="cb21-2"></span>
<span id="cb21-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Define a list of vectors</span></span>
<span id="cb21-4">num_list <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">list</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">a =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">b =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">6</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">c =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">7</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">9</span>)</span>
<span id="cb21-5"></span>
<span id="cb21-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Use map_lgl() to check if all elements in each vector are even</span></span>
<span id="cb21-7">even_check <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">map_lgl</span>(num_list, <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(x) <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">all</span>(x <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%%</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>))</span>
<span id="cb21-8"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">print</span>(even_check)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>    a     b     c 
FALSE FALSE FALSE </code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb23" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb23-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Use map_int() to compute the sum of each vector</span></span>
<span id="cb23-2">vector_sums <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">map_int</span>(num_list, sum)</span>
<span id="cb23-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">print</span>(vector_sums)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code> a  b  c 
 6 15 24 </code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb25" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb25-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Use map_dbl() to compute the mean of each vector</span></span>
<span id="cb25-2">vector_means <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">map_dbl</span>(num_list, mean)</span>
<span id="cb25-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">print</span>(vector_means)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>a b c 
2 5 8 </code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb27" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb27-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Use map_chr() to convert each vector to a character vector</span></span>
<span id="cb27-2">vector_strings <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">map_chr</span>(num_list, toString)</span>
<span id="cb27-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">print</span>(vector_strings)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>        a         b         c 
"1, 2, 3" "4, 5, 6" "7, 8, 9" </code></pre>
</div>
</div>
<p>By using these specialized variants, you can ensure that the output of your mapping operation adheres to your specific data type requirements, leading to cleaner and more predictable code.</p>
</section>
<section id="performance-comparison" class="level2">
<h2 class="anchored" data-anchor-id="performance-comparison"><strong>Performance Comparison</strong></h2>
<p>To compare the performance of these functions, it’s important to note that the execution time may vary depending on the hardware specifications of your computer, the size of the dataset, and the complexity of the operations performed. While one function may perform better in one scenario, it may not be the case in another. Therefore, it’s recommended to benchmark the functions in your specific use case.</p>
<p>Let’s benchmark the computation of the sum of a large list using different functions:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb29" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb29-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(microbenchmark)</span>
<span id="cb29-2"></span>
<span id="cb29-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Create a 100 x 100 matrix</span></span>
<span id="cb29-4">matrix_data <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">matrix</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rnorm</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10000</span>), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">nrow =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>)</span>
<span id="cb29-5"></span>
<span id="cb29-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Use apply() function to compute the sum for each column</span></span>
<span id="cb29-7">benchmark_results <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">microbenchmark</span>(</span>
<span id="cb29-8">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">apply_sum =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">apply</span>(matrix_data, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, sum),</span>
<span id="cb29-9">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sapply_sum =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sapply</span>(matrix_data, sum),</span>
<span id="cb29-10">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">lapply_sum =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lapply</span>(matrix_data, sum),</span>
<span id="cb29-11">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">map_sum =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">map_dbl</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">as.list</span>(matrix_data), sum),  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># We need to convert the matrix to a list for the map function</span></span>
<span id="cb29-12">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">times =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span></span>
<span id="cb29-13">)</span>
<span id="cb29-14"></span>
<span id="cb29-15"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">print</span>(benchmark_results)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>Unit: microseconds
       expr      min       lq       mean   median        uq       max neval
  apply_sum  200.648  233.577   251.3542  245.394   261.078   537.739   100
 sapply_sum 3698.164 3842.919  4359.8187 3993.574  4212.604  8374.192   100
 lapply_sum 3338.470 3435.519  3997.8134 3611.807  3808.278  7994.968   100
    map_sum 9371.513 9614.495 10584.8131 9904.801 11340.739 20365.188   100</code></pre>
</div>
</div>
<p><strong><code>apply_sum</code></strong> demonstrates the fastest processing time among the alternatives,. These results suggest that while <strong><code>apply()</code></strong> function offers the fastest processing time, it’s still relatively slow compared to other options. When evaluating these results, it’s crucial to consider factors beyond processing time, such as usability and functionality, to select the most suitable function for your specific needs.</p>
<p>Overall, the choice of function depends on factors such as speed, ease of use, and compatibility with the data structure. It’s essential to benchmark different alternatives in your specific use case to determine the most suitable function for your needs.</p>
</section>
<section id="conclusion" class="level2">
<h2 class="anchored" data-anchor-id="conclusion"><strong>Conclusion</strong></h2>
<p>Apply functions (<strong><code>apply()</code></strong>, <strong><code>sapply()</code></strong>, <strong><code>lapply()</code></strong>) and the <strong><code>map()</code></strong> function from the purrr package are powerful tools for data manipulation and analysis in R. Each function has its unique features and strengths, making them suitable for various tasks.</p>
<ul>
<li><p><strong><code>apply()</code></strong> function is versatile and operates on matrices, allowing for row-wise or column-wise operations. However, its performance may vary depending on the size of the dataset and the nature of the computation.</p></li>
<li><p><strong><code>sapply()</code></strong> and <strong><code>lapply()</code></strong> functions are convenient for working with lists and provide more optimized implementations compared to <strong><code>apply()</code></strong>. They offer flexibility and ease of use, making them suitable for a wide range of tasks.</p></li>
<li><p><strong><code>map()</code></strong> function offers a more consistent syntax compared to <strong><code>lapply()</code></strong> and provides additional variants (<strong><code>map_lgl()</code></strong>, <strong><code>map_int()</code></strong>, <strong><code>map_dbl()</code></strong>, <strong><code>map_chr()</code></strong>) for handling specific data types. While it may exhibit slower performance in some cases, its functionality and ease of use make it a valuable tool for functional programming in R.</p></li>
</ul>
<p>When choosing the most suitable function for your task, it’s essential to consider factors beyond just performance. Usability, compatibility with data structures, and the nature of the computation should also be taken into account. Additionally, the performance of these functions may vary depending on the hardware specifications of your computer, the size of the dataset, and the complexity of the operations performed. Therefore, it’s recommended to benchmark the functions in your specific use case and evaluate them based on multiple criteria to make an informed decision.</p>
<p>By mastering these functions and understanding their nuances, you can streamline your data analysis workflows and tackle a wide range of analytical tasks with confidence in R.</p>


</section>

 ]]></description>
  <category>R Programming</category>
  <category>apply</category>
  <category>sapply</category>
  <category>lapply</category>
  <category>map</category>
  <guid>https://mfatihtuzen.github.io/posts/2024-04-15_apply_map/</guid>
  <pubDate>Mon, 15 Apr 2024 00:00:00 GMT</pubDate>
</item>
<item>
  <title>R Function Writing 101:A Journey Through Syntax, Best Practices, and More</title>
  <dc:creator>M. Fatih Tüzen</dc:creator>
  <link>https://mfatihtuzen.github.io/posts/2024-01-22_functions/</link>
  <description><![CDATA[ 






<section id="introduction" class="level2">
<h2 class="anchored" data-anchor-id="introduction"><strong>Introduction</strong></h2>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://mfatihtuzen.github.io/posts/2024-01-22_functions/gears.jpg" class="img-fluid quarto-figure quarto-figure-center figure-img"></p>
</figure>
</div>
<p>R is a powerful and versatile programming language widely used in data analysis, statistics, and visualization. One of the key features that make R so flexible is its ability to create functions. Functions in R allow you to encapsulate a set of instructions into a reusable and modular block of code, promoting code organization and efficiency. Much like a well-engineered machine, where gears work together seamlessly, functions provide the backbone for modular, efficient, and structured code. As we delve into the syntax, best practices, and hands-on examples, envision the gears turning in unison, each function contributing to the overall functionality of your programs. In this blog post, we will delve into the world of writing functions in R, exploring the syntax, best practices, and showcasing interesting examples.</p>
</section>
<section id="basics-of-writing-functions-in-r" class="level2">
<h2 class="anchored" data-anchor-id="basics-of-writing-functions-in-r"><strong>Basics of Writing Functions in R</strong></h2>
<p><strong>Syntax:</strong></p>
<p>In R, a basic function has the following syntax:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1">my_function <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(arg1, arg2, ...) {</span>
<span id="cb1-2">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Function body</span></span>
<span id="cb1-3">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Perform operations using arg1, arg2, ...</span></span>
<span id="cb1-4">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">return</span>(result)</span>
<span id="cb1-5">}</span></code></pre></div></div>
</div>
<ul>
<li><p><strong><code>my_function</code></strong>: The name you assign to your function.</p></li>
<li><p><strong><code>arg1, arg2, ...</code></strong>: Arguments passed to the function.</p></li>
<li><p><strong><code>return(result)</code></strong>: The result that the function will produce.</p></li>
</ul>
<p><strong>Example:</strong></p>
<p>Let’s create a simple function that adds two numbers:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb2-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Define a function named 'square'</span></span>
<span id="cb2-2">square <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(x) {</span>
<span id="cb2-3">  result <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> x<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">^</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span></span>
<span id="cb2-4">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">return</span>(result)</span>
<span id="cb2-5">}</span>
<span id="cb2-6"></span>
<span id="cb2-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Usage of the function</span></span>
<span id="cb2-8">squared_value <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">square</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>)</span>
<span id="cb2-9"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">print</span>(squared_value)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 16</code></pre>
</div>
</div>
<p>Now, let’s break down the components of this example:</p>
<ol type="1">
<li><p><strong>Function Definition:</strong></p>
<ul>
<li><strong><code>square</code></strong> is the name assigned to the function.</li>
</ul></li>
<li><p><strong>Parameter:</strong></p>
<ul>
<li><strong><code>x</code></strong> is the single parameter or argument that the function expects. It represents the number you want to square.</li>
</ul></li>
<li><p><strong>Function Body:</strong></p>
<ul>
<li>The body of the function is enclosed in curly braces <strong><code>{}</code></strong>. Inside, <strong><code>result &lt;- x^2</code></strong> calculates the square of <strong><code>x</code></strong>.</li>
</ul></li>
<li><p><strong>Return Statement:</strong></p>
<ul>
<li><strong><code>return(result)</code></strong> specifies that the calculated square is the output of the function.</li>
</ul></li>
<li><p><strong>Usage:</strong></p>
<ul>
<li><strong><code>square(4)</code></strong> is an example of calling the function with the value 4. The result is stored in the variable <strong><code>squared_value</code></strong>.</li>
</ul></li>
<li><p><strong>Print Output:</strong></p>
<ul>
<li><strong><code>print(squared_value)</code></strong> prints the result to the console, and the output is <strong><code>16</code></strong>.</li>
</ul></li>
</ol>
<p>This function takes a single argument, squares it, and returns the result. You can customize and use this type of function to perform specific operations on individual values, making your code more modular and readable.</p>
</section>
<section id="advanced-function-features" class="level2">
<h2 class="anchored" data-anchor-id="advanced-function-features"><strong>Advanced Function Features</strong></h2>
<section id="default-arguments" class="level3">
<h3 class="anchored" data-anchor-id="default-arguments">Default Arguments</h3>
<p>“Default Arguments” refers to a feature in R functions that allows you to specify default values for function parameters. Default arguments provide a predefined value for a parameter in case the user does not explicitly provide a value when calling the function.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb4-1">power_function <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(x, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">exponent =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>) {</span>
<span id="cb4-2">  result <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> x <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">^</span> exponent</span>
<span id="cb4-3">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">return</span>(result)</span>
<span id="cb4-4">}</span></code></pre></div></div>
</div>
<p>In this example, we define a function called <strong><code>power_function</code></strong> that takes two parameters: <strong><code>x</code></strong> and <strong><code>exponent</code></strong>. Here’s a step-by-step explanation:</p>
<ol type="1">
<li><p><strong>Function Definition:</strong></p>
<ul>
<li><strong><code>power_function</code></strong> is the name of the function.</li>
</ul></li>
<li><p><strong>Parameters:</strong></p>
<ul>
<li><strong><code>x</code></strong> and <strong><code>exponent</code></strong> are the parameters (or arguments) that the function accepts.</li>
</ul></li>
<li><p><strong>Default Value:</strong></p>
<ul>
<li><strong><code>exponent = 2</code></strong> indicates that if the user does not provide a value for <strong><code>exponent</code></strong> when calling the function, it will default to 2.</li>
</ul></li>
<li><p><strong>Function Body:</strong></p>
<ul>
<li>The function body is enclosed in curly braces <strong><code>{}</code></strong> and contains the code that the function will execute.</li>
</ul></li>
<li><p><strong>Calculation:</strong></p>
<ul>
<li>Inside the function body, <strong><code>result &lt;- x ^ exponent</code></strong> calculates the result by raising <strong><code>x</code></strong> to the power of <strong><code>exponent</code></strong>.</li>
</ul></li>
<li><p><strong>Return Statement:</strong></p>
<ul>
<li><strong><code>return(result)</code></strong> specifies that the calculated result will be the output of the function.</li>
</ul></li>
</ol>
<p>Now, let’s see how this function can be used:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb5-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Usage</span></span>
<span id="cb5-2">power_of_3 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">power_function</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>)</span>
<span id="cb5-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">print</span>(power_of_3) </span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 9</code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb7-1">power_of_3_cubed <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">power_function</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>)</span>
<span id="cb7-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">print</span>(power_of_3_cubed) </span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 27</code></pre>
</div>
</div>
<p>Here, we demonstrate two usages of the <strong><code>power_function</code></strong>:</p>
<ol type="1">
<li><p><strong>Without Providing <code>exponent</code>:</strong></p>
<ul>
<li><strong><code>power_function(3)</code></strong> uses the default value of <strong><code>exponent = 2</code></strong>, resulting in <strong><code>3 ^ 2</code></strong>, which is 9.</li>
</ul></li>
<li><p><strong>Providing a Custom <code>exponent</code>:</strong></p>
<ul>
<li><strong><code>power_function(3, 3)</code></strong> explicitly provides a value for <strong><code>exponent</code></strong>, resulting in <strong><code>3 ^ 3</code></strong>, which is 27.</li>
</ul></li>
</ol>
<p>In summary, the default argument (<strong><code>exponent = 2</code></strong>) makes the function more flexible by providing a sensible default value for the <strong><code>exponent</code></strong> parameter, but users can override it by supplying their own value when needed.</p>
</section>
<section id="variable-arguments" class="level3">
<h3 class="anchored" data-anchor-id="variable-arguments">Variable Arguments</h3>
<p>In R, the <strong><code>...</code></strong> (ellipsis) allows you to work with a variable number of arguments in a function, offering flexibility and convenience. This magical feature empowers you to create functions that can handle different inputs without explicitly defining each one.</p>
<p><strong>Properties of <code>...</code>:</strong></p>
<ul>
<li><p><strong>Variable Number of Arguments:</strong></p>
<ul>
<li><strong><code>...</code></strong> allows you to accept an arbitrary number of arguments in your function.</li>
</ul></li>
<li><p><strong>Passing Arguments to Other Functions:</strong></p>
<ul>
<li>You can pass the ellipsis (<strong><code>...</code></strong>) to other functions within your function, making it extremely versatile.</li>
</ul></li>
</ul>
<p>Let’s break down the code example:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb9-1">sum_all <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(...) {</span>
<span id="cb9-2">  numbers <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(...)</span>
<span id="cb9-3">  result <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sum</span>(numbers)</span>
<span id="cb9-4">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">return</span>(result)</span>
<span id="cb9-5">}</span></code></pre></div></div>
</div>
<p>Here’s a step-by-step explanation of the code:</p>
<ol type="1">
<li><p><strong>Function Definition:</strong></p>
<ul>
<li><strong><code>sum_all</code></strong> is the name of the function.</li>
</ul></li>
<li><p><strong>Variable Arguments:</strong></p>
<ul>
<li><strong><code>...</code></strong> is used as a placeholder for a variable number of arguments. It allows the function to accept any number of arguments.</li>
</ul></li>
<li><p><strong>Combining Arguments into a Vector:</strong></p>
<ul>
<li><strong><code>numbers &lt;- c(...)</code></strong> combines all the arguments passed to the function into a vector named <strong><code>numbers</code></strong>.</li>
</ul></li>
<li><p><strong>Summation:</strong></p>
<ul>
<li><strong><code>result &lt;- sum(numbers)</code></strong> calculates the sum of all the numbers in the vector.</li>
</ul></li>
<li><p><strong>Return Statement:</strong></p>
<ul>
<li><strong><code>return(result)</code></strong> specifies that the calculated sum will be the output of the function.</li>
</ul></li>
</ol>
<p>Now, let’s see how this function can be used:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb10" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb10-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Usage</span></span>
<span id="cb10-2">total_sum1 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sum_all</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>)</span>
<span id="cb10-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">print</span>(total_sum1)  </span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 15</code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb12" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb12-1">total_sum2 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sum_all</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">20</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">30</span>)</span>
<span id="cb12-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">print</span>(total_sum2) </span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 60</code></pre>
</div>
</div>
<p>In the usage examples:</p>
<ul>
<li><p><strong><code>sum_all(1, 2, 3, 4, 5)</code></strong> passes five arguments to the function, and the sum is calculated as <strong><code>1 + 2 + 3 + 4 + 5</code></strong>, resulting in 15.</p></li>
<li><p><strong><code>sum_all(10, 20, 30)</code></strong> passes three arguments, and the sum is calculated as <strong><code>10 + 20 + 30</code></strong>, resulting in 60.</p></li>
</ul>
<p>This function allows flexibility by accepting any number of arguments, making it suitable for scenarios where the user may need to sum a dynamic set of values. The ellipsis (<strong><code>...</code></strong>) serves as a convenient mechanism for handling variable arguments in R functions.</p>
</section>
<section id="multiple-arguments-in-r-functions" class="level3">
<h3 class="anchored" data-anchor-id="multiple-arguments-in-r-functions">Multiple Arguments in R Functions</h3>
<p>Using multiple arguments when writing a function in the R programming language means accepting and working with more than one input parameter.. In R, functions can be defined to take multiple arguments, allowing for greater flexibility and customization when calling the function with different sets of data.</p>
<p>Here’s a general structure of a function with multiple arguments in R:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb14" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb14-1">my_function <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(arg1, arg2, ...) {</span>
<span id="cb14-2">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Function body</span></span>
<span id="cb14-3">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Perform operations using arg1, arg2, ...</span></span>
<span id="cb14-4">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">return</span>(result)</span>
<span id="cb14-5">}</span></code></pre></div></div>
</div>
<p>Let’s break down the components:</p>
<ul>
<li><p><strong><code>my_function</code></strong>: The name you assign to your function.</p></li>
<li><p><strong><code>arg1, arg2, ...</code></strong>: Parameters or arguments passed to the function.</p></li>
<li><p><strong><code>...</code></strong>: The ellipsis (<strong><code>...</code></strong>) represents variable arguments, allowing the function to accept a variable number of parameters.</p></li>
</ul>
<p>Here’s a more concrete example:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb15" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb15-1">calculate_sum <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(x, y) {</span>
<span id="cb15-2">  result <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> x <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> y</span>
<span id="cb15-3">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">return</span>(result)</span>
<span id="cb15-4">}</span>
<span id="cb15-5"></span>
<span id="cb15-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Usage</span></span>
<span id="cb15-7">sum_result <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">calculate_sum</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>)</span>
<span id="cb15-8"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">print</span>(sum_result) </span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 8</code></pre>
</div>
</div>
<p>In this example, the <strong><code>calculate_sum</code></strong> function takes two arguments (<strong><code>x</code></strong> and <strong><code>y</code></strong>) and returns their sum. You can call the function with different values for <strong><code>x</code></strong> and <strong><code>y</code></strong> to obtain different results.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb17" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb17-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Usage</span></span>
<span id="cb17-2">result1 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">calculate_sum</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">15</span>)</span>
<span id="cb17-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">print</span>(result1)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 25</code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb19" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb19-1">result2 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">calculate_sum</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span>)</span>
<span id="cb19-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">print</span>(result2)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 3</code></pre>
</div>
</div>
<p>This flexibility in handling multiple arguments makes R functions versatile and adaptable to various tasks. You can design functions to perform complex operations or calculations by allowing users to input different sets of data through multiple parameters.</p>
</section>
<section id="returning-multiple-outputs-from-a-function-in-r" class="level3">
<h3 class="anchored" data-anchor-id="returning-multiple-outputs-from-a-function-in-r">Returning Multiple Outputs from a Function in R</h3>
<p>In R, functions traditionally return a <strong>single object</strong>. However, in many real-world data analysis workflows, we often need a function to return <strong>multiple outputs simultaneously</strong> — such as several statistics, model results, or diagnostic values.</p>
<p>To achieve this, the most common approach in R is to <strong>return a named list</strong>. This provides flexibility, structure, and easy access to individual components.</p>
<p>Below are some practical examples demonstrating this concept.</p>
<section id="example-1-returning-multiple-summary-statistics" class="level4">
<h4 class="anchored" data-anchor-id="example-1-returning-multiple-summary-statistics">Example 1: Returning Multiple Summary Statistics</h4>
<p>Let’s say we want to compute the mean, median, and standard deviation of a numeric vector:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb21" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb21-1">summary_stats <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(x) {</span>
<span id="cb21-2">  mean_x <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(x, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">na.rm =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>)</span>
<span id="cb21-3">  median_x <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">median</span>(x, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">na.rm =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>)</span>
<span id="cb21-4">  sd_x <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sd</span>(x, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">na.rm =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>)</span>
<span id="cb21-5">  </span>
<span id="cb21-6">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">return</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">list</span>(</span>
<span id="cb21-7">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">mean =</span> mean_x,</span>
<span id="cb21-8">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">median =</span> median_x,</span>
<span id="cb21-9">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sd =</span> sd_x</span>
<span id="cb21-10">  ))</span>
<span id="cb21-11">}</span>
<span id="cb21-12"></span>
<span id="cb21-13">data <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">20</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">30</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">40</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">50</span>)</span>
<span id="cb21-14">result <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summary_stats</span>(data)</span>
<span id="cb21-15"></span>
<span id="cb21-16">result<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>mean    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># 30</span></span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 30</code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb23" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb23-1">result<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>median  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># 30</span></span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 30</code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb25" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb25-1">result<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>sd      <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># 15.81</span></span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 15.81139</code></pre>
</div>
</div>
<p><strong>What’s happening?</strong></p>
<ul>
<li><p>The function <code>summary_stats()</code> returns a named list with three numeric values.</p></li>
<li><p>You can access each result using <code>$</code>, e.g., <code>result$sd</code>.</p></li>
</ul>
</section>
<section id="example-2-returning-a-data-frame-and-plot-together" class="level4">
<h4 class="anchored" data-anchor-id="example-2-returning-a-data-frame-and-plot-together">Example 2: Returning a Data Frame and Plot Together</h4>
<p>Sometimes we want a function to return both <strong>a table</strong> and <strong>a visualization</strong>.</p>
<div class="cell" data-messages="false">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb27" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb27-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(ggplot2)</span>
<span id="cb27-2"></span>
<span id="cb27-3">analyze_distribution <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(x) {</span>
<span id="cb27-4">  df <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(</span>
<span id="cb27-5">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">value =</span> x,</span>
<span id="cb27-6">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">z =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">scale</span>(x)</span>
<span id="cb27-7">  )</span>
<span id="cb27-8">  </span>
<span id="cb27-9">  plot <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggplot</span>(df, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> value)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb27-10">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_histogram</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">bins =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">fill =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"steelblue"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">color =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"white"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb27-11">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme_minimal</span>()</span>
<span id="cb27-12">  </span>
<span id="cb27-13">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">return</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">list</span>(</span>
<span id="cb27-14">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">table =</span> df,</span>
<span id="cb27-15">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">histogram =</span> plot</span>
<span id="cb27-16">  ))</span>
<span id="cb27-17">}</span>
<span id="cb27-18"></span>
<span id="cb27-19">data <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rnorm</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>)</span>
<span id="cb27-20">output <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">analyze_distribution</span>(data)</span>
<span id="cb27-21"></span>
<span id="cb27-22"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">head</span>(output<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>table)     <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Shows the first few rows of the table</span></span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>       value           z
1  0.3667810  0.50919731
2  0.2490425  0.38116000
3  0.6608920  0.82903484
4 -0.7017313 -0.65277993
5 -0.1806294 -0.08609613
6  0.3228995  0.46147742</code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb29" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb29-1">output<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>histogram       <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Displays the ggplot2 histogram</span></span></code></pre></div></div>
<div class="cell-output-display">
<div>
<figure class="figure">
<p><img src="https://mfatihtuzen.github.io/posts/2024-01-22_functions/index_files/figure-html/unnamed-chunk-11-1.png" class="img-fluid figure-img" width="672"></p>
</figure>
</div>
</div>
</div>
<p><strong>Takeaways:</strong></p>
<ul>
<li><p>This function returns both a <code>data.frame</code> and a <code>ggplot</code> object.</p></li>
<li><p>This is especially useful for reporting functions in packages or Shiny applications.</p></li>
</ul>
</section>
<section id="bonus-tip-named-lists-vs.-tibbles" class="level4">
<h4 class="anchored" data-anchor-id="bonus-tip-named-lists-vs.-tibbles">Bonus Tip: Named Lists vs.&nbsp;Tibbles</h4>
<p>While lists are flexible, in some modeling contexts (e.g., when nesting or mapping), it can be useful to wrap outputs in a tibble:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb30" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb30-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(tibble)</span>
<span id="cb30-2"></span>
<span id="cb30-3">multi_return <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(x) {</span>
<span id="cb30-4">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">tibble</span>(</span>
<span id="cb30-5">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">input =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">list</span>(x),</span>
<span id="cb30-6">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">summary =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">list</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summary</span>(x)),</span>
<span id="cb30-7">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sd =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sd</span>(x)</span>
<span id="cb30-8">  )</span>
<span id="cb30-9">}</span></code></pre></div></div>
</div>
<p><strong>In summary;</strong> R does not support multiple return values like Python’s tuple unpacking, but <strong>lists</strong> and <strong>tibbles</strong> allow us to simulate this pattern elegantly. Whether you are building utility functions or modularizing a complex pipeline, returning multiple outputs as a single structured object is both powerful and idiomatic in R.</p>
</section>
</section>
</section>
<section id="more-examples" class="level2">
<h2 class="anchored" data-anchor-id="more-examples"><strong>More Examples</strong></h2>
<section id="mean-of-a-numeric-vector" class="level3">
<h3 class="anchored" data-anchor-id="mean-of-a-numeric-vector">Mean of a Numeric Vector</h3>
<p>Let’s create a simple function that calculates the mean of a numeric vector in R. The function will take a numeric vector as its argument and return the mean value.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb31" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb31-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Define a function named 'calculate_mean'</span></span>
<span id="cb31-2">calculate_mean <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(numbers) {</span>
<span id="cb31-3">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Check if 'numbers' is numeric</span></span>
<span id="cb31-4">  <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> (<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.numeric</span>(numbers)) {</span>
<span id="cb31-5">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">stop</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Input must be a numeric vector."</span>)</span>
<span id="cb31-6">  }</span>
<span id="cb31-7"></span>
<span id="cb31-8">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Calculate the mean</span></span>
<span id="cb31-9">  result <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(numbers)</span>
<span id="cb31-10">  </span>
<span id="cb31-11">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Return the mean</span></span>
<span id="cb31-12">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">return</span>(result)</span>
<span id="cb31-13">}</span>
<span id="cb31-14"></span>
<span id="cb31-15"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Usage of the function</span></span>
<span id="cb31-16">numeric_vector <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">6</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>)</span>
<span id="cb31-17">mean_result <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">calculate_mean</span>(numeric_vector)</span>
<span id="cb31-18"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">print</span>(mean_result)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 6</code></pre>
</div>
</div>
<p>In this function we also check the input validation. <strong><code>if (!is.numeric(numbers))</code></strong> checks if the input vector is numeric. If not, an error message is displayed using <strong><code>stop()</code></strong>.</p>
</section>
<section id="calculate-exponential-growth" class="level3">
<h3 class="anchored" data-anchor-id="calculate-exponential-growth">Calculate Exponential Growth</h3>
<p>Let’s create a function to calculate the exponential growth of a quantity over time. Exponential growth is a mathematical concept where a quantity increases by a fixed percentage rate over a given period.</p>
<p>Here’s an example of how you might write a function in R to calculate exponential growth:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb33" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb33-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Define a function to calculate exponential growth</span></span>
<span id="cb33-2">calculate_exponential_growth <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(initial_value, growth_rate, time_period) {</span>
<span id="cb33-3">  final_value <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> initial_value <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> (<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> growth_rate)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">^</span>time_period</span>
<span id="cb33-4">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">return</span>(final_value)</span>
<span id="cb33-5">}</span>
<span id="cb33-6"></span>
<span id="cb33-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Usage of the function</span></span>
<span id="cb33-8">initial_value <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span>  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Initial quantity</span></span>
<span id="cb33-9">growth_rate <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.05</span>    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># 5% growth rate</span></span>
<span id="cb33-10">time_period <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>       <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># 3 years</span></span>
<span id="cb33-11"></span>
<span id="cb33-12">final_result <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">calculate_exponential_growth</span>(initial_value, growth_rate, time_period)</span>
<span id="cb33-13"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">print</span>(final_result)  </span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 1157.625</code></pre>
</div>
</div>
<p><strong>Explanation:</strong></p>
<ul>
<li><p>The function <strong><code>calculate_exponential_growth</code></strong> takes three parameters: <strong><code>initial_value</code></strong> (the starting quantity), <strong><code>growth_rate</code></strong> (the percentage growth rate per period), and <strong><code>time_period</code></strong> (the number of periods).</p></li>
<li><p>Inside the function, it calculates the final value after the given time period using the formula for exponential growth:</p></li>
</ul>
<p><img src="https://latex.codecogs.com/png.latex?%0AFinal%20Value%20=%20Initial%20Value%5Ctimes%20(1+Growth%20Rate)%5E%7BTimePeriod%7D%20%20%20%20%0A"></p>
<ul>
<li><p>The calculated final value is stored in the variable <strong><code>final_value</code></strong>.</p></li>
<li><p>The function returns the final value.</p></li>
</ul>
<p><strong>In the usage example:</strong></p>
<ul>
<li><p>The initial quantity is set to 1000.</p></li>
<li><p>The growth rate is set to 5% (0.05).</p></li>
<li><p>The time period is set to 3 years.</p></li>
<li><p>The function is called with these values, and the result is printed to the console.</p></li>
</ul>
<p>This is just one example of how you might use a function to calculate exponential growth. Depending on your specific requirements, you can modify the function and parameters to suit different scenarios.</p>
</section>
<section id="calculate-compound-interest" class="level3">
<h3 class="anchored" data-anchor-id="calculate-compound-interest">Calculate Compound Interest</h3>
<p>Suppose that we want to create a function to calculate compound interest over time. Compound interest is a financial concept where interest is calculated not only on the initial principal amount but also on the accumulated interest from previous periods. The formula for compound interest is often expressed as:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AA=%20P%5Ctimes(1+%5Cfrac%7Br%7D%7Bn%7D)%5E%7Bnt%7D%0A"></p>
<p>where:</p>
<ul>
<li><p><img src="https://latex.codecogs.com/png.latex?A"> is the amount of money accumulated after <img src="https://latex.codecogs.com/png.latex?n"> years, including interest.</p></li>
<li><p><img src="https://latex.codecogs.com/png.latex?P"> is the principal amount (initial investment).</p></li>
<li><p><img src="https://latex.codecogs.com/png.latex?r"> is the annual interest rate (as a decimal).</p></li>
<li><p><img src="https://latex.codecogs.com/png.latex?n"> is the number of times that interest is compounded per unit <img src="https://latex.codecogs.com/png.latex?t"> (usually per year).</p></li>
<li><p><img src="https://latex.codecogs.com/png.latex?t"> is the time the money is invested or borrowed for, in years.</p></li>
</ul>
<p>Here’s an example of how you might write a function in R to calculate compound interest:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb35" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb35-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Define a function to calculate compound interest</span></span>
<span id="cb35-2">calculate_compound_interest <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(principal, rate, time, compounding_frequency) {</span>
<span id="cb35-3">  amount <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> principal <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> (<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> rate<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>compounding_frequency)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">^</span>(compounding_frequency<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span>time)</span>
<span id="cb35-4">  interest <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> amount <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> principal</span>
<span id="cb35-5">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">return</span>(interest)</span>
<span id="cb35-6">}</span>
<span id="cb35-7"></span>
<span id="cb35-8"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Usage of the function</span></span>
<span id="cb35-9">initial_principal <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span>  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Initial investment</span></span>
<span id="cb35-10">annual_interest_rate <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.05</span>  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># 5% annual interest rate</span></span>
<span id="cb35-11">investment_time <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># 3 years</span></span>
<span id="cb35-12">compounding_frequency <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">12</span>  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Monthly compounding</span></span>
<span id="cb35-13"></span>
<span id="cb35-14">compound_interest_result <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">calculate_compound_interest</span>(initial_principal, annual_interest_rate, investment_time, compounding_frequency)</span>
<span id="cb35-15"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">print</span>(compound_interest_result)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] 161.4722</code></pre>
</div>
</div>
<p><strong>Explanation:</strong></p>
<ul>
<li><p>The function <strong><code>calculate_compound_interest</code></strong> takes four parameters: <strong><code>principal</code></strong> (the initial investment), <strong><code>rate</code></strong> (the annual interest rate), <strong><code>time</code></strong> (the time the money is invested for, in years), and <strong><code>compounding_frequency</code></strong> (the number of times interest is compounded per year).</p></li>
<li><p>Inside the function, it calculates the amount using the compound interest formula.</p></li>
<li><p>It then calculates the interest earned by subtracting the initial principal from the final amount.</p></li>
<li><p>The function returns the calculated compound interest.</p></li>
</ul>
<p><strong>In the usage example:</strong></p>
<ul>
<li><p>The initial investment is set to $1000.</p></li>
<li><p>The annual interest rate is set to 5% (0.05).</p></li>
<li><p>The investment time is set to 3 years.</p></li>
<li><p>Interest is compounded monthly (12 times per year).</p></li>
<li><p>The function is called with these values, and the result (compound interest) is printed to the console.</p></li>
</ul>
<p>This example illustrates how you can use a function to calculate compound interest for a given investment scenario. Adjust the parameters based on your specific financial context.</p>
</section>
<section id="custom-plotting-function" class="level3">
<h3 class="anchored" data-anchor-id="custom-plotting-function">Custom Plotting Function</h3>
<p>Let’s enhance the custom plotting function using the ellipsis (<strong><code>...</code></strong>) to allow for additional customization parameters. The ellipsis allows you to pass a variable number of arguments to the function, providing more flexibility.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb37" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb37-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Define a custom plotting function with ellipsis</span></span>
<span id="cb37-2">custom_plot <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(x_values, y_values, ..., <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">plot_type =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"line"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">title =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Custom Plot"</span>) {</span>
<span id="cb37-3">  plot_title <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">paste</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Custom Plot: "</span>, title)</span>
<span id="cb37-4">  </span>
<span id="cb37-5">  <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> (plot_type <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"line"</span>) {</span>
<span id="cb37-6">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">plot</span>(x_values, y_values, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">type =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"l"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">col =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"blue"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">main =</span> plot_title, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">xlab =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"X-axis"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">ylab =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Y-axis"</span>, ...)</span>
<span id="cb37-7">  } <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> (plot_type <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"scatter"</span>) {</span>
<span id="cb37-8">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">plot</span>(x_values, y_values, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">col =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"red"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">main =</span> plot_title, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">xlab =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"X-axis"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">ylab =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Y-axis"</span>, ...)</span>
<span id="cb37-9">  } <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span> {</span>
<span id="cb37-10">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">warning</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Invalid plot type. Defaulting to line plot."</span>)</span>
<span id="cb37-11">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">plot</span>(x_values, y_values, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">type =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"l"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">col =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"blue"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">main =</span> plot_title, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">xlab =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"X-axis"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">ylab =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Y-axis"</span>, ...)</span>
<span id="cb37-12">  }</span>
<span id="cb37-13">}</span>
<span id="cb37-14"></span>
<span id="cb37-15"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Usage of the custom plotting function with ellipsis</span></span>
<span id="cb37-16">x_data <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>)</span>
<span id="cb37-17">y_data <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">6</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>)</span>
<span id="cb37-18"></span>
<span id="cb37-19"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Create a line plot with additional customization (e.g., xlim, ylim)</span></span>
<span id="cb37-20"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">custom_plot</span>(x_data, y_data, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">plot_type =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"line"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">xlim =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">6</span>), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">ylim =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">12</span>), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">title =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Line Plot with Customization"</span>)</span></code></pre></div></div>
<div class="cell-output-display">
<div>
<figure class="figure">
<p><img src="https://mfatihtuzen.github.io/posts/2024-01-22_functions/index_files/figure-html/unnamed-chunk-16-1.png" class="img-fluid figure-img" width="672"></p>
</figure>
</div>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb38" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb38-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Create a scatter plot with additional customization (e.g., pch, cex)</span></span>
<span id="cb38-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">custom_plot</span>(x_data, y_data, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">plot_type =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"scatter"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">pch =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">16</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">cex =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.5</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">title =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Scatter Plot with Customization"</span>)</span></code></pre></div></div>
<div class="cell-output-display">
<div>
<figure class="figure">
<p><img src="https://mfatihtuzen.github.io/posts/2024-01-22_functions/index_files/figure-html/unnamed-chunk-16-2.png" class="img-fluid figure-img" width="672"></p>
</figure>
</div>
</div>
</div>
<p>Explanation:</p>
<ul>
<li><p>The <strong><code>...</code></strong> in the function definition allows for additional parameters to be passed to the <strong><code>plot</code></strong> function.</p></li>
<li><p>Inside the function, the <strong><code>plot</code></strong> function is called with the <strong><code>...</code></strong> argument, allowing any additional customization options to be applied to the plot.</p></li>
<li><p>In the usage examples, additional parameters such as <strong><code>xlim</code></strong>, <strong><code>ylim</code></strong>, <strong><code>pch</code></strong>, and <strong><code>cex</code></strong> are passed to customize the appearance of the plots.</p></li>
</ul>
<p>Wtih using ellipsis (<strong><code>...</code></strong>) the custom plotting function is more versatile, allowing users to pass any valid plotting parameters to further customize the appearance of the plots. Users can now customize the plots according to their specific needs without modifying the function itself.</p>
</section>
</section>
<section id="best-practices-for-writing-functions" class="level2">
<h2 class="anchored" data-anchor-id="best-practices-for-writing-functions"><strong>Best Practices for Writing Functions</strong></h2>
<p>Writing functions in R is a fundamental aspect of creating efficient, readable, and maintainable code. As R enthusiasts, developers, and data scientists, adopting best practices for writing functions is crucial to ensure the quality and usability of our codebase. Whether you’re working on a small script or a large-scale project, following established guidelines can greatly enhance the clarity, modularity, and reliability of your functions.</p>
<p>This section will explore a set of best practices designed to streamline the process of function development in R. From choosing descriptive function names to documenting your code and validating inputs, each practice is geared towards fostering code that is not only functional but also comprehensible to both yourself and others. These practices are aimed at promoting consistency, minimizing errors, and facilitating collaboration by adhering to widely accepted conventions in the R programming community.</p>
<p>Whether you are a novice R user or an experienced developer, integrating these best practices into your workflow will undoubtedly lead to more efficient and effective code. Let’s embark on a journey to explore the key principles that will elevate your R programming skills and empower you to create functions that are both powerful and user-friendly.</p>
<p>Here are some key best practices for writing functions in R:</p>
<ol type="1">
<li><strong>Use Descriptive Function Names:</strong> Choose clear and descriptive names for your functions that convey their purpose. This makes the code more understandable.</li>
</ol>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb39" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb39-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Good example</span></span>
<span id="cb39-2">calculate_mean <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(data) {</span>
<span id="cb39-3">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Function body</span></span>
<span id="cb39-4">}</span>
<span id="cb39-5"></span>
<span id="cb39-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Avoid</span></span>
<span id="cb39-7">fn <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(d) {</span>
<span id="cb39-8">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Function body</span></span>
<span id="cb39-9">}</span></code></pre></div></div>
</div>
<ol start="2" type="1">
<li><strong>Document Your Functions:</strong> Include comments or documentation (using <strong><code>#'</code></strong>) within your function to explain its purpose, input parameters, and expected output. This helps other users (or yourself) understand how to use the function.</li>
</ol>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb40" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb40-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Good example</span></span>
<span id="cb40-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#' Calculate the mean of a numeric vector.</span></span>
<span id="cb40-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#'</span></span>
<span id="cb40-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#' @param data Numeric vector for which mean is calculated.</span></span>
<span id="cb40-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#' @return Mean value.</span></span>
<span id="cb40-6">calculate_mean <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(data) {</span>
<span id="cb40-7">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Function body</span></span>
<span id="cb40-8">}</span></code></pre></div></div>
</div>
<ol start="3" type="1">
<li><strong>Validate Inputs:</strong> Check the validity of input parameters within your function. Ensure that the inputs meet the expected format and constraints.</li>
</ol>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb41" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb41-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Good example</span></span>
<span id="cb41-2">calculate_mean <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(data) {</span>
<span id="cb41-3">  <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> (<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.numeric</span>(data)) {</span>
<span id="cb41-4">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">stop</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Input must be a numeric vector."</span>)</span>
<span id="cb41-5">  }</span>
<span id="cb41-6">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Function body</span></span>
<span id="cb41-7">}</span></code></pre></div></div>
</div>
<ol start="4" type="1">
<li><strong>Avoid Global Variables:</strong> Minimize the use of global variables within your functions. Instead, pass required parameters as arguments to make functions more modular and reusable.</li>
</ol>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb42" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb42-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Good example</span></span>
<span id="cb42-2">calculate_mean <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(data) {</span>
<span id="cb42-3">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Function body using 'data'</span></span>
<span id="cb42-4">}</span></code></pre></div></div>
</div>
<ol start="5" type="1">
<li><strong>Separate Concerns:</strong> Divide your code into modular and focused functions, each addressing a specific concern. This promotes reusability and makes your code more maintainable.</li>
</ol>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb43" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb43-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Good example</span></span>
<span id="cb43-2">calculate_mean <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(data) {</span>
<span id="cb43-3">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Function body</span></span>
<span id="cb43-4">}</span>
<span id="cb43-5"></span>
<span id="cb43-6">plot_histogram <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(data) {</span>
<span id="cb43-7">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Function body</span></span>
<span id="cb43-8">}</span></code></pre></div></div>
</div>
<ol start="6" type="1">
<li><strong>Avoid Global Side Effects:</strong> Minimize changes to global variables within your functions. Functions should ideally return results rather than modifying global states.</li>
</ol>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb44" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb44-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Good example</span></span>
<span id="cb44-2">calculate_mean <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(data) {</span>
<span id="cb44-3">  result <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(data)</span>
<span id="cb44-4">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">return</span>(result)</span>
<span id="cb44-5">}</span></code></pre></div></div>
</div>
<ol start="7" type="1">
<li><strong>Use Default Argument Values:</strong> Set default values for function arguments when it makes sense. This improves the usability of your functions by allowing users to omit optional arguments.</li>
</ol>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb45" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb45-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Good example</span></span>
<span id="cb45-2">calculate_mean <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(data, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">na.rm =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span>) {</span>
<span id="cb45-3">  result <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(data, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">na.rm =</span> na.rm)</span>
<span id="cb45-4">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">return</span>(result)</span>
<span id="cb45-5">}</span></code></pre></div></div>
</div>
<ol start="8" type="1">
<li><strong>Test Your Functions:</strong> Develop test cases to ensure that your functions behave as expected. Testing helps catch bugs early and provides confidence in the reliability of your code.</li>
</ol>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb46" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb46-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Good example (using testthat package)</span></span>
<span id="cb46-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">test_that</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"calculate_mean returns the correct result"</span>, {</span>
<span id="cb46-3">  data <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>)</span>
<span id="cb46-4">  result <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">calculate_mean</span>(data)</span>
<span id="cb46-5">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">expect_equal</span>(result, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>)</span>
<span id="cb46-6">})</span></code></pre></div></div>
</div>
<p>By following these best practices, you can create functions that are more robust, understandable, and adaptable, contributing to the overall quality of your R code.</p>
</section>
<section id="conclusion" class="level2">
<h2 class="anchored" data-anchor-id="conclusion"><strong>Conclusion</strong></h2>
<p>Mastering the art of writing functions in R is essential for efficient and organized programming. Whether you’re performing simple calculations or tackling complex problems, functions empower you to write cleaner, more maintainable code. By following best practices and exploring diverse examples, you can elevate your R programming skills and unleash the full potential of this versatile language.</p>
<p>As we reach the conclusion of our exploration, take a moment to appreciate the symphony of gears turning—a reflection of the interconnected brilliance of functions in R. From simple calculations to complex algorithms, each function plays a vital role in the harmony of your code.</p>
<p>Armed with a deeper understanding of syntax, best practices, and real-world examples, you now possess the tools to craft efficient and organized functions. Like a well-tuned machine, let your code operate smoothly, with each function contributing to the overall success of your programming endeavors.</p>
<p>Happy coding, and may your gears always turn with precision! 🚀⚙️</p>


</section>

 ]]></description>
  <category>R Programming</category>
  <category>Functions</category>
  <guid>https://mfatihtuzen.github.io/posts/2024-01-22_functions/</guid>
  <pubDate>Tue, 23 Jan 2024 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Cracking the Code of Categorical Data: A Guide to Factors in R</title>
  <dc:creator>M. Fatih Tüzen</dc:creator>
  <link>https://mfatihtuzen.github.io/posts/2024-01-11_factors/</link>
  <description><![CDATA[ 






<section id="introduction" class="level2">
<h2 class="anchored" data-anchor-id="introduction"><strong>Introduction</strong></h2>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><a href="https://allisonhorst.com/everything-else"><img src="https://mfatihtuzen.github.io/posts/2024-01-11_factors/nominal_ordinal_binary.png" class="img-fluid quarto-figure quarto-figure-center figure-img"></a></p>
</figure>
</div>
<figcaption>https://allisonhorst.com/everything-else</figcaption>
</figure>
</div>
<p>R programming is a versatile language known for its powerful statistical and data manipulation capabilities. One often-overlooked feature that plays a crucial role in organizing and analyzing data is the use of factors. In this blog post, we’ll delve into the world of factors, exploring what they are, why they are important, and how they can be effectively utilized in R programming.</p>
</section>
<section id="creation-of-factors" class="level2">
<h2 class="anchored" data-anchor-id="creation-of-factors"><strong>Creation of Factors</strong></h2>
<p>Creating factors in R involves converting categorical data into a specific data type that represents distinct levels. The most common method involves using the <strong><code>factor()</code></strong> function.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Creating a factor from a character vector</span></span>
<span id="cb1-2">gender_vector <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rep</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Male"</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>),<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rep</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Female"</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">7</span>))</span>
<span id="cb1-3">gender_factor <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">factor</span>(gender_vector)</span>
<span id="cb1-4"></span>
<span id="cb1-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Displaying the factor</span></span>
<span id="cb1-6"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">print</span>(gender_factor)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code> [1] Male   Male   Male   Male   Male   Female Female Female Female Female
[11] Female Female
Levels: Female Male</code></pre>
</div>
</div>
<p>You can explicitly specify the levels when creating a factor.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb3-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Creating a factor with specified levels</span></span>
<span id="cb3-2">education_vector <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"High School"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Bachelor's"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Master's"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"PhD"</span>)</span>
<span id="cb3-3">education_factor <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">factor</span>(education_vector, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">levels =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"High School"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Bachelor's"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Master's"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"PhD"</span>))</span>
<span id="cb3-4"></span>
<span id="cb3-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Displaying the factor</span></span>
<span id="cb3-6"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">print</span>(education_factor)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] High School Bachelor's  Master's    PhD        
Levels: High School Bachelor's Master's PhD</code></pre>
</div>
</div>
<p>For ordinal data, factors can be ordered.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb5-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Creating an ordered factor</span></span>
<span id="cb5-2">rating_vector <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span>  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rep</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Low"</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>),<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rep</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Medium"</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>),<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rep</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"High"</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>))</span>
<span id="cb5-3">rating_factor <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">factor</span>(rating_vector, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">ordered =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">levels =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Low"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Medium"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"High"</span>))</span>
<span id="cb5-4"></span>
<span id="cb5-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Displaying the ordered factor</span></span>
<span id="cb5-6"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">print</span>(rating_factor)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code> [1] Low    Low    Low    Low    Medium Medium Medium Medium Medium High  
[11] High  
Levels: Low &lt; Medium &lt; High</code></pre>
</div>
</div>
<p>You can change the order of levels. <code>ordered=TRUE</code> indicates that the levels are ordered.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb7-1">rating_vector_2 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">factor</span>(rating_vector,</span>
<span id="cb7-2">                          <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">levels =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"High"</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Medium"</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Low"</span>), </span>
<span id="cb7-3">                          <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">ordered =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>)</span>
<span id="cb7-4"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">print</span>(rating_vector_2)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code> [1] Low    Low    Low    Low    Medium Medium Medium Medium Medium High  
[11] High  
Levels: High &lt; Medium &lt; Low</code></pre>
</div>
</div>
<div class="callout callout-style-default callout-tip callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Tip
</div>
</div>
<div class="callout-body-container callout-body">
<p>You can also use <strong><code>gl()</code></strong>&nbsp;function in order to generate factors by specifying the pattern of their levels.</p>
<pre><code>Syntax:
gl(n, k, length, labels, ordered)

Parameters:
n: Number of levels
k: Number of replications
length: Length of result
labels: Labels for the vector(optional)
ordered: Boolean value to order the levels</code></pre>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb10" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb10-1">new_factor <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">gl</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">n =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>, </span>
<span id="cb10-2">                 <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">k =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>, </span>
<span id="cb10-3">                 <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">labels =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"level1"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"level2"</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"level3"</span>),</span>
<span id="cb10-4">                 <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">ordered =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>)</span>
<span id="cb10-5"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">print</span>(new_factor)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code> [1] level1 level1 level1 level1 level2 level2 level2 level2 level3 level3
[11] level3 level3
Levels: level1 &lt; level2 &lt; level3</code></pre>
</div>
</div>
</div>
</div>
</section>
<section id="understanding-factors" class="level2">
<h2 class="anchored" data-anchor-id="understanding-factors">Understanding Factors</h2>
<p>In R, a factor is a data type used to categorize and store data. Essentially, it represents a categorical variable and is particularly useful when dealing with variables that have a fixed number of unique values. Factors can be thought of as a way to represent and work with categorical data efficiently.</p>
<p>Factors in R programming are not merely a data type; they are a powerful tool for elevating the efficiency and interpretability of your code. Whether you are analyzing survey responses, evaluating educational levels, or visualizing temperature categories, factors bring a level of organization and clarity that is indispensable in the data analysis landscape. By embracing factors, you unlock a sophisticated approach to handling categorical data, enabling you to extract deeper insights from your datasets and empowering your R code with a robust foundation for statistical analyses.</p>
<p>Factors are employed in various scenarios, from handling categorical data, statistical modeling, memory efficiency, maintaining data integrity, creating visualizations, to simplifying data manipulation tasks in R programming.</p>
<section id="categorical-data-representation" class="level3">
<h3 class="anchored" data-anchor-id="categorical-data-representation"><strong>Categorical Data Representation</strong></h3>
<p>Factors allow you to efficiently represent categorical data in R. Categorical variables, such as gender, education level, or geographic region, are common in many datasets. Factors provide a structured way to handle and analyze these categories. Converting this into a factor not only groups these levels but also standardizes their representation across the dataset, allowing for consistent analysis.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb12" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb12-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Sample data as a vector</span></span>
<span id="cb12-2">gender <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Male"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Female"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Male"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Male"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Female"</span>)</span>
<span id="cb12-3"></span>
<span id="cb12-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Converting to factor</span></span>
<span id="cb12-5">gender_factor <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">factor</span>(gender)</span>
<span id="cb12-6"></span>
<span id="cb12-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Checking levels</span></span>
<span id="cb12-8"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">levels</span>(gender_factor)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] "Female" "Male"  </code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb14" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb14-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Checking unique values within the factor</span></span>
<span id="cb14-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">unique</span>(gender_factor)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>[1] Male   Female
Levels: Female Male</code></pre>
</div>
</div>
</section>
<section id="statistical-analysis-and-modeling" class="level3">
<h3 class="anchored" data-anchor-id="statistical-analysis-and-modeling"><strong>Statistical Analysis and Modeling</strong></h3>
<p>Statistical models often require categorical variables to be converted into factors. When performing regression analysis or any statistical modeling in R, factors ensure that categorical variables are correctly interpreted, allowing models to account for categorical variations in the data.</p>
<p>Let’s examine the example to include two factor variables and showcase their roles in a statistical model. We’ll consider the scenario of exploring the impact of both income levels and education levels on spending behavior.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb16" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb16-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Simulated data for spending behavior</span></span>
<span id="cb16-2">n <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span></span>
<span id="cb16-3">spending <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">runif</span>(n, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">min =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">max =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">600</span>)</span>
<span id="cb16-4"></span>
<span id="cb16-5">income_levels <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sample</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Low"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"High"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Medium"</span>), </span>
<span id="cb16-6">                        <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">size =</span> n, </span>
<span id="cb16-7">                        <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">replace =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>)</span>
<span id="cb16-8">education_levels <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sample</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"High School"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Graduate"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Undergraduate"</span>), </span>
<span id="cb16-9">                           <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">size =</span> n, </span>
<span id="cb16-10">                           <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">replace =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>)</span>
<span id="cb16-11"></span>
<span id="cb16-12"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Creating factor variables for income and education</span></span>
<span id="cb16-13">income_factor <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">factor</span>(income_levels)</span>
<span id="cb16-14">education_factor <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">factor</span>(education_levels)</span>
<span id="cb16-15"></span>
<span id="cb16-16"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Linear model with both income and education as factor variables</span></span>
<span id="cb16-17">model <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lm</span>(spending <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> income_factor <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> education_factor)</span>
<span id="cb16-18"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summary</span>(model)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>
Call:
lm(formula = spending ~ income_factor + education_factor)

Residuals:
    Min      1Q  Median      3Q     Max 
-267.22 -114.00   20.13  103.80  234.03 

Coefficients:
                              Estimate Std. Error t value Pr(&gt;|t|)    
(Intercept)                     318.94      32.69   9.756 5.48e-16 ***
income_factorLow                 11.26      35.17   0.320    0.750    
income_factorMedium              39.44      35.18   1.121    0.265    
education_factorHigh School     -12.23      34.95  -0.350    0.727    
education_factorUndergraduate    51.43      33.21   1.549    0.125    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 138.9 on 95 degrees of freedom
Multiple R-squared:  0.05486,   Adjusted R-squared:  0.01507 
F-statistic: 1.379 on 4 and 95 DF,  p-value: 0.2472</code></pre>
</div>
</div>
<p>The output summary of the model will now provide information about the impact of both income levels and education levels on spending:</p>
<ul>
<li><p><strong>Coefficients:</strong> Each factor level within <strong><code>income_factor</code></strong> and <strong><code>education_factor</code></strong> will have its own coefficient, indicating its estimated impact on spending.</p></li>
<li><p><strong>Interactions:</strong> If there is an interaction term (which we don’t have in this simplified example), it would represent the combined effect of both factors on the response variable.</p></li>
</ul>
<p>The summary output will provide a comprehensive view of how different combinations of income and education levels influence spending behavior. This type of model allows for a more nuanced understanding of the relationships between multiple categorical variables and a continuous response variable.</p>
</section>
<section id="efficiency-in-memory-and-performance" class="level3">
<h3 class="anchored" data-anchor-id="efficiency-in-memory-and-performance"><strong>Efficiency in Memory and Performance</strong></h3>
<p>Factors in R are implemented as integers that point to a levels attribute, which contains unique values within the categorical variable. This representation can save memory compared to storing string labels for each observation. It also speeds up some operations as integers are more efficiently handled in computations.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb18" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb18-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Creating a large dataset with a categorical variable</span></span>
<span id="cb18-2">large_data <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sample</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"A"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"B"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"C"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"D"</span>), <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">^</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">6</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">replace =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>)</span>
<span id="cb18-3"></span>
<span id="cb18-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Memory usage comparison</span></span>
<span id="cb18-5"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">object.size</span>(large_data) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Memory usage without factor</span></span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>8000272 bytes</code></pre>
</div>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb20" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb20-1">large_data_factor <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">factor</span>(large_data)</span>
<span id="cb20-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">object.size</span>(large_data_factor) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Memory usage with factor</span></span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>4000688 bytes</code></pre>
</div>
</div>
<p>In this example:</p>
<ol type="1">
<li><p>We generate a large dataset (<strong><code>large_data</code></strong>) with a categorical variable.</p></li>
<li><p>We compare the memory usage between the original character vector and the factor representation.</p></li>
</ol>
<p>When you run the code, you’ll observe that the memory usage of the factor representation is significantly smaller than that of the character vector. This highlights the memory efficiency gained by representing categorical variables as factors.</p>
<p>The compact integer representation not only saves memory but also accelerates various operations involving categorical variables. This is particularly advantageous when working with extensive datasets or when dealing with resource constraints.</p>
<p>Efficient memory usage becomes critical in scenarios where datasets are substantial, such as in big data analytics or machine learning tasks. By leveraging factors, R programmers can ensure that their code runs smoothly and effectively, even when dealing with large and complex datasets.</p>
</section>
<section id="data-integrity-and-consistency" class="level3">
<h3 class="anchored" data-anchor-id="data-integrity-and-consistency"><strong>Data Integrity and Consistency</strong></h3>
<p>Factors enforce the integrity of categorical data. They ensure that only predefined levels are used within a variable, preventing the introduction of new, unforeseen categories. This maintains consistency and prevents errors in analysis or modeling caused by unexpected categories.</p>
<p>One of the key features of factors is their ability to explicitly define and enforce levels within a categorical variable. This ensures that the data conforms to a consistent set of categories, providing a robust framework for analysis.</p>
<p>Consider a scenario where we have a factor representing temperature categories: ‘Low’, ‘Medium’, and ‘High’. Let’s explore how factors help maintain consistency:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb22" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb22-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Creating a factor with specified levels</span></span>
<span id="cb22-2">temperature <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Low"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Medium"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"High"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Low"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Extreme"</span>)</span>
<span id="cb22-3"></span>
<span id="cb22-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Defining specific levels</span></span>
<span id="cb22-5">temperature_factor <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">factor</span>(temperature, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">levels =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Low"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Medium"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"High"</span>))</span>
<span id="cb22-6"></span>
<span id="cb22-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Replacing with an undefined level will generate a warning</span></span>
<span id="cb22-8">temperature_factor[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>] <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Extreme High"</span></span></code></pre></div></div>
<div class="cell-output cell-output-stderr">
<pre><code>Warning in `[&lt;-.factor`(`*tmp*`, 5, value = "Extreme High"): invalid factor
level, NA generated</code></pre>
</div>
</div>
<p>In this example:</p>
<ol type="1">
<li><p>We create a factor representing temperature categories.</p></li>
<li><p>We explicitly define specific levels using the <strong><code>levels</code></strong> parameter.</p></li>
<li><p>An attempt to introduce a new, undefined level (‘Extreme High’) generates a warning.</p></li>
</ol>
<p>When you run the code, you’ll observe that attempting to replace a level with an undefined value triggers a warning. This emphasizes the role of factors in preserving data integrity and consistency. Any attempt to introduce new or undefined categories is flagged, preventing unintended changes to the data.</p>
<p>In real-world scenarios, maintaining data integrity is crucial for accurate analyses and meaningful interpretations. Factors provide a safeguard against inadvertent errors, ensuring that the categorical data remains consistent throughout the analysis process. This is particularly important in collaborative projects or situations where data is sourced from multiple channels.</p>
</section>
<section id="graphical-representations-and-visualizations" class="level3">
<h3 class="anchored" data-anchor-id="graphical-representations-and-visualizations"><strong>Graphical Representations and Visualizations</strong></h3>
<p>Factors in R contribute significantly to the creation of clear and insightful visualizations. By ensuring proper ordering and labeling of categorical data, factors play a pivotal role in generating meaningful graphs and charts that enhance data interpretation.</p>
<p>When creating visual representations of data, such as bar plots or pie charts, factors provide a structured foundation. They ensure that the categories are appropriately arranged and labeled, allowing for accurate communication of insights.</p>
<p>Let’s create a simple bar plot using the <strong><code>ggplot2</code></strong> library, showcasing the distribution of product categories:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb24" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb24-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Sample data: product categories</span></span>
<span id="cb24-2"></span>
<span id="cb24-3">categories <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sample</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Electronics"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Clothing"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Food"</span>),</span>
<span id="cb24-4">                     <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">size =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">20</span> ,</span>
<span id="cb24-5">                     <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">replace =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>)</span>
<span id="cb24-6">category_factor <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">factor</span>(categories)</span>
<span id="cb24-7"></span>
<span id="cb24-8"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Creating a bar plot with factors using ggplot2</span></span>
<span id="cb24-9"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(ggplot2)</span>
<span id="cb24-10"></span>
<span id="cb24-11"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Creating a data frame for ggplot</span></span>
<span id="cb24-12">data <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">category =</span> category_factor)</span>
<span id="cb24-13"></span>
<span id="cb24-14"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Creating a bar plot</span></span>
<span id="cb24-15"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggplot</span>(data, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> category, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">fill =</span> category)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb24-16">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_bar</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb24-17">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">labs</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">title =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Distribution of Product Categories"</span>, </span>
<span id="cb24-18">       <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Category"</span>, </span>
<span id="cb24-19">       <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Count"</span>)</span></code></pre></div></div>
<div class="cell-output-display">
<div>
<figure class="figure">
<p><img src="https://mfatihtuzen.github.io/posts/2024-01-11_factors/index_files/figure-html/unnamed-chunk-10-1.png" class="img-fluid figure-img" width="672"></p>
</figure>
</div>
</div>
</div>
<p>In this example:</p>
<ol type="1">
<li><p>We have a sample dataset representing different product categories.</p></li>
<li><p>The variable <strong><code>category_factor</code></strong> is a factor representing these categories.</p></li>
<li><p>We use <strong><code>ggplot2</code></strong> to create a bar plot, mapping the factor levels to the x-axis and fill color.</p></li>
</ol>
<p>When you run the code, you’ll generate a bar plot that effectively visualizes the distribution of product categories. The factor ensures that the categories are properly ordered and labeled, providing a clear representation of the data.</p>
<p>In data analysis, effective visualization is often the key to conveying insights to stakeholders. By leveraging factors in graphical representations, R users enhance the clarity and interpretability of their visualizations. This is particularly valuable when dealing with categorical data, where the correct representation of levels is essential for accurate communication.</p>
</section>
</section>
<section id="conclusion" class="level2">
<h2 class="anchored" data-anchor-id="conclusion">Conclusion</h2>
<p>In the intricate world of data analysis, where insights hide within categorical nuances, factors in R emerge as indispensable guides, offering a pathway to crack the code of categorical data. Through the exploration of their multifaceted roles, we’ve uncovered how factors bring structure, efficiency, and integrity to the table.</p>
<p>Factors, as revealed in our journey, stand as the bedrock for efficient data representation and manipulation. They unlock the power of statistical modeling, enabling us to dissect the impact of categorical variables on outcomes with precision. Memory efficiency becomes a notable ally, especially in the face of colossal datasets, where factors shine by optimizing computational performance.</p>
<p>Maintaining data integrity is a critical aspect of any analytical endeavor, and factors act as vigilant guardians, ensuring that categorical variables adhere to predefined levels. The blog post showcased how factors not only prevent unintended changes but also serve as sentinels against the introduction of undefined categories.</p>
<p>The journey through the visualization realm illustrated that factors are not just behind-the-scenes players; they are conductors orchestrating visually compelling narratives. By ensuring proper ordering and labeling, factors elevate the impact of graphical representations, making categorical data come alive in meaningful visual stories.</p>
<p>As we conclude our guide to factors in R, we find ourselves equipped with a toolkit to navigate the categorical maze. Whether you’re a seasoned data scientist or an aspiring analyst, embracing factors unlocks a deeper understanding of your data, paving the way for more accurate analyses, clearer visualizations, and robust statistical models.</p>
<p>Cracking the code of categorical data is not merely a technical feat—it’s an art. Factors, in their simplicity and versatility, empower us to decode the richness embedded in categorical variables, turning what might seem like a labyrinth into a comprehensible landscape of insights. So, let the journey with factors in R be your compass, guiding you through the intricate tapestry of categorical data analysis. Happy coding!</p>


</section>

 ]]></description>
  <category>R Programming</category>
  <category>data types</category>
  <category>factor</category>
  <category>categorical data</category>
  <guid>https://mfatihtuzen.github.io/posts/2024-01-11_factors/</guid>
  <pubDate>Thu, 11 Jan 2024 00:00:00 GMT</pubDate>
</item>
</channel>
</rss>
