@@ -216,7 +216,7 @@ <h2 class="anchored" data-anchor-id="scenario-auditing-a-partner-dataset">Scenar
216216< section id ="step-1-efficiently-loading-a-large-csv " class ="level2 ">
217217< h2 class ="anchored " data-anchor-id ="step-1-efficiently-loading-a-large-csv "> Step 1: Efficiently Loading a Large CSV</ h2 >
218218< p > Rather than using < code > pandas.read_csv()</ code > directly, < strong > csvplus</ strong > provides < code > load_optimized_csv()</ code > to reduce memory usage automatically.</ p >
219- < div id ="73856fcf " class ="cell " data-execution_count ="1 ">
219+ < div id ="9cea5837 " class ="cell " data-execution_count ="1 ">
220220< div class ="code-copy-outer-scaffold "> < div class ="sourceCode cell-code " id ="cb1 "> < pre class ="sourceCode python code-with-copy "> < code class ="sourceCode python "> < span id ="cb1-1 "> < a href ="#cb1-1 " aria-hidden ="true " tabindex ="-1 "> </ a > < span class ="im "> from</ span > csvplus.load_optimized_csv < span class ="im "> import</ span > load_optimized_csv</ span >
221221< span id ="cb1-2 "> < a href ="#cb1-2 " aria-hidden ="true " tabindex ="-1 "> </ a > </ span >
222222< span id ="cb1-3 "> < a href ="#cb1-3 " aria-hidden ="true " tabindex ="-1 "> </ a > < span class ="co "> # Load a large csv dataset</ span > </ span >
@@ -243,7 +243,7 @@ <h3 class="anchored" data-anchor-id="what-this-function-does">What this function
243243< section id ="step-2-resolving-inconsistent-string-values " class ="level2 ">
244244< h2 class ="anchored " data-anchor-id ="step-2-resolving-inconsistent-string-values "> Step 2: Resolving Inconsistent String Values</ h2 >
245245< p > Text fields such as company names are often inconsistent due to typos or formatting differences. The < code > resolve_string_value()</ code > function uses fuzzy matching to standardize values.</ p >
246- < div id ="60265d52 " class ="cell " data-execution_count ="2 ">
246+ < div id ="0eb8d769 " class ="cell " data-execution_count ="2 ">
247247< div class ="code-copy-outer-scaffold "> < div class ="sourceCode cell-code " id ="cb2 "> < pre class ="sourceCode python code-with-copy "> < code class ="sourceCode python "> < span id ="cb2-1 "> < a href ="#cb2-1 " aria-hidden ="true " tabindex ="-1 "> </ a > < span class ="im "> import</ span > pandas < span class ="im "> as</ span > pd</ span >
248248< span id ="cb2-2 "> < a href ="#cb2-2 " aria-hidden ="true " tabindex ="-1 "> </ a > < span class ="im "> from</ span > csvplus.data_correction < span class ="im "> import</ span > resolve_string_value</ span >
249249< span id ="cb2-3 "> < a href ="#cb2-3 " aria-hidden ="true " tabindex ="-1 "> </ a > </ span >
@@ -278,7 +278,7 @@ <h2 class="anchored" data-anchor-id="step-2-resolving-inconsistent-string-values
278278< section id ="step-3-generating-a-data-summary-report " class ="level2 ">
279279< h2 class ="anchored " data-anchor-id ="step-3-generating-a-data-summary-report "> Step 3: Generating a Data Summary Report</ h2 >
280280< p > Before performing deeper analysis, it is often helpful to understand the structure and quality of the dataset.</ p >
281- < div id ="bfdd8fdc " class ="cell " data-execution_count ="3 ">
281+ < div id ="ff2c9a34 " class ="cell " data-execution_count ="3 ">
282282< div class ="code-copy-outer-scaffold "> < div class ="sourceCode cell-code " id ="cb4 "> < pre class ="sourceCode python code-with-copy "> < code class ="sourceCode python "> < span id ="cb4-1 "> < a href ="#cb4-1 " aria-hidden ="true " tabindex ="-1 "> </ a > < span class ="im "> import</ span > pandas < span class ="im "> as</ span > pd</ span >
283283< span id ="cb4-2 "> < a href ="#cb4-2 " aria-hidden ="true " tabindex ="-1 "> </ a > < span class ="im "> from</ span > csvplus.generate_report < span class ="im "> import</ span > summary_report</ span >
284284< span id ="cb4-3 "> < a href ="#cb4-3 " aria-hidden ="true " tabindex ="-1 "> </ a > </ span >
@@ -338,7 +338,7 @@ <h2 class="anchored" data-anchor-id="step-3-generating-a-data-summary-report">St
338338< li > Confidence intervals</ li >
339339</ ul >
340340< p > To inspect categorical columns:</ p >
341- < div id ="94253df7 " class ="cell " data-execution_count ="4 ">
341+ < div id ="ed4459eb " class ="cell " data-execution_count ="4 ">
342342< div class ="code-copy-outer-scaffold "> < div class ="sourceCode cell-code " id ="cb5 "> < pre class ="sourceCode python code-with-copy "> < code class ="sourceCode python "> < span id ="cb5-1 "> < a href ="#cb5-1 " aria-hidden ="true " tabindex ="-1 "> </ a > categorical_stats.loc[< span class ="st "> 'city'</ span > , < span class ="st "> 'n_unique'</ span > ]</ span >
343343< span id ="cb5-2 "> < a href ="#cb5-2 " aria-hidden ="true " tabindex ="-1 "> </ a > categorical_stats.loc[< span class ="st "> 'city'</ span > , < span class ="st "> 'n_unique'</ span > ]</ span >
344344< span id ="cb5-3 "> < a href ="#cb5-3 " aria-hidden ="true " tabindex ="-1 "> </ a > categorical_stats.loc[< span class ="st "> 'city'</ span > , < span class ="st "> 'top_values'</ span > ]</ span >
@@ -352,7 +352,7 @@ <h2 class="anchored" data-anchor-id="step-3-generating-a-data-summary-report">St
352352< section id ="step-4-comparing-dataset-versions " class ="level2 ">
353353< h2 class ="anchored " data-anchor-id ="step-4-comparing-dataset-versions "> Step 4: Comparing Dataset Versions</ h2 >
354354< p > A week later, you receive an updated CSV file of the original CSV with potential schema and data changes. At this point, you want to understand < strong > what changed</ strong > compared to the original dataset. After loading the dataset, the < code > data_version_diff()</ code > function computes a structured comparison between the two DataFrames.</ p >
355- < div id ="3577fc11 " class ="cell " data-execution_count ="5 ">
355+ < div id ="297e6e10 " class ="cell " data-execution_count ="5 ">
356356< div class ="code-copy-outer-scaffold "> < div class ="sourceCode cell-code " id ="cb7 "> < pre class ="sourceCode python code-with-copy "> < code class ="sourceCode python "> < span id ="cb7-1 "> < a href ="#cb7-1 " aria-hidden ="true " tabindex ="-1 "> </ a > < span class ="im "> import</ span > pandas < span class ="im "> as</ span > pd</ span >
357357< span id ="cb7-2 "> < a href ="#cb7-2 " aria-hidden ="true " tabindex ="-1 "> </ a > < span class ="im "> from</ span > csvplus.data_version_diff < span class ="im "> import</ span > data_version_diff</ span >
358358< span id ="cb7-3 "> < a href ="#cb7-3 " aria-hidden ="true " tabindex ="-1 "> </ a > </ span >
@@ -408,19 +408,19 @@ <h2 class="anchored" data-anchor-id="step-4-comparing-dataset-versions">Step 4:
408408< section id ="step-5-inspecting-dataframe-changes-programmatically " class ="level2 ">
409409< h2 class ="anchored " data-anchor-id ="step-5-inspecting-dataframe-changes-programmatically "> Step 5: Inspecting Dataframe Changes Programmatically</ h2 >
410410< p > You can explore specific components of the diff object directly:</ p >
411- < div id ="cfed0b8f " class ="cell " data-execution_count ="6 ">
411+ < div id ="490aeee8 " class ="cell " data-execution_count ="6 ">
412412< div class ="code-copy-outer-scaffold "> < div class ="sourceCode cell-code " id ="cb9 "> < pre class ="sourceCode python code-with-copy "> < code class ="sourceCode python "> < span id ="cb9-1 "> < a href ="#cb9-1 " aria-hidden ="true " tabindex ="-1 "> </ a > diff[< span class ="st "> "columns_added"</ span > ]</ span > </ code > </ pre > </ div > < button title ="Copy to Clipboard " class ="code-copy-button "> < i class ="bi "> </ i > </ button > </ div >
413413< div class ="cell-output cell-output-display " data-execution_count ="5 ">
414414< pre > < code > ['amount', 'category']</ code > </ pre >
415415</ div >
416416</ div >
417- < div id ="650fd399 " class ="cell " data-execution_count ="7 ">
417+ < div id ="4f961608 " class ="cell " data-execution_count ="7 ">
418418< div class ="code-copy-outer-scaffold "> < div class ="sourceCode cell-code " id ="cb11 "> < pre class ="sourceCode python code-with-copy "> < code class ="sourceCode python "> < span id ="cb11-1 "> < a href ="#cb11-1 " aria-hidden ="true " tabindex ="-1 "> </ a > diff[< span class ="st "> "row_count_change"</ span > ]</ span > </ code > </ pre > </ div > < button title ="Copy to Clipboard " class ="code-copy-button "> < i class ="bi "> </ i > </ button > </ div >
419419< div class ="cell-output cell-output-display " data-execution_count ="6 ">
420420< pre > < code > {'old_row_count': 3, 'new_row_count': 4, 'row_difference': 1}</ code > </ pre >
421421</ div >
422422</ div >
423- < div id ="c1dee536 " class ="cell " data-execution_count ="8 ">
423+ < div id ="be427c76 " class ="cell " data-execution_count ="8 ">
424424< div class ="code-copy-outer-scaffold "> < div class ="sourceCode cell-code " id ="cb13 "> < pre class ="sourceCode python code-with-copy "> < code class ="sourceCode python "> < span id ="cb13-1 "> < a href ="#cb13-1 " aria-hidden ="true " tabindex ="-1 "> </ a > diff[< span class ="st "> "missing_value_changes"</ span > ]</ span > </ code > </ pre > </ div > < button title ="Copy to Clipboard " class ="code-copy-button "> < i class ="bi "> </ i > </ button > </ div >
425425< div class ="cell-output cell-output-display " data-execution_count ="7 ">
426426< div >
@@ -457,7 +457,7 @@ <h2 class="anchored" data-anchor-id="step-5-inspecting-dataframe-changes-program
457457</ div >
458458</ div >
459459</ div >
460- < div id ="d8847202 " class ="cell " data-execution_count ="9 ">
460+ < div id ="eb4381e1 " class ="cell " data-execution_count ="9 ">
461461< div class ="code-copy-outer-scaffold "> < div class ="sourceCode cell-code " id ="cb14 "> < pre class ="sourceCode python code-with-copy "> < code class ="sourceCode python "> < span id ="cb14-1 "> < a href ="#cb14-1 " aria-hidden ="true " tabindex ="-1 "> </ a > diff[< span class ="st "> "numeric_summary_changes"</ span > ]</ span > </ code > </ pre > </ div > < button title ="Copy to Clipboard " class ="code-copy-button "> < i class ="bi "> </ i > </ button > </ div >
462462< div class ="cell-output cell-output-display " data-execution_count ="8 ">
463463< div >
@@ -518,7 +518,7 @@ <h2 class="anchored" data-anchor-id="step-5-inspecting-dataframe-changes-program
518518< section id ="step-6-displaying-a-human-readable-report " class ="level2 ">
519519< h2 class ="anchored " data-anchor-id ="step-6-displaying-a-human-readable-report "> Step 6: Displaying a Human-Readable Report</ h2 >
520520< p > For interactive use, < strong > csvplus</ strong > provides a clean, console-friendly summary of changes in your dataframes from step 5.</ p >
521- < div id ="7721ba0a " class ="cell " data-execution_count ="10 ">
521+ < div id ="1e29af5b " class ="cell " data-execution_count ="10 ">
522522< div class ="code-copy-outer-scaffold "> < div class ="sourceCode cell-code " id ="cb15 "> < pre class ="sourceCode python code-with-copy "> < code class ="sourceCode python "> < span id ="cb15-1 "> < a href ="#cb15-1 " aria-hidden ="true " tabindex ="-1 "> </ a > < span class ="im "> from</ span > csvplus.data_version_diff < span class ="im "> import</ span > display_data_version_diff</ span >
523523< span id ="cb15-2 "> < a href ="#cb15-2 " aria-hidden ="true " tabindex ="-1 "> </ a > </ span >
524524< span id ="cb15-3 "> < a href ="#cb15-3 " aria-hidden ="true " tabindex ="-1 "> </ a > display_data_version_diff(diff)</ span > </ code > </ pre > </ div > < button title ="Copy to Clipboard " class ="code-copy-button "> < i class ="bi "> </ i > </ button > </ div >
0 commit comments