通过与另一个DataFrame进行比较来筛选DataFrame

huangapple go评论55阅读模式
英文:

Filtering a DataFrame by comparing to another DataFrame

问题

I'm very new to using Python, and have been working on a script that will allow me to download a csv, clean the data and create an output file that can be imported correctly into a Shopify store.

So far, I've managed to accomplish this, generating a daily import file that contains the relevant data to import into my shop.

What I want to do to further enhance the code, is have the output file filtered to only include rows where there has been a change. In other words, if a product's cost or price changes, or it comes out of pre-order status, I want it to be included, but if the items' lines are the same in both the product list and the latest stockfeed it should be excluded from the import.

Update: I've been working on this for the last few days, and I think I'm closer but I'm still not able to get that last step out of the way.

I've gotten to the point that I've created two DataFrames for the comparison: df_merged and df_prod_filtered.

I've generated the following from the script for troubleshooting purposes:

df_merged
Columns:  ['ID', 'Handle', 'Variant Compare At Price', 'Template Suffix', 'Variant Inventory Policy', 'Variant Cost']
Index column of df_merged:  None
Number of rows:  1802
Column types:
 ID                            int64
Handle                       object
Variant Compare At Price    float64
Template Suffix              object
Variant Inventory Policy     object
Variant Cost                float64
dtype: object

df_prod_filtered
Columns:  ['ID', 'Handle', 'Variant Compare At Price', 'Template Suffix', 'Variant Inventory Policy', 'Variant Cost']
Index column of df_prod_filtered:  None
Number of rows:  1802
Column types:
 ID                            int64
Handle                       object
Variant Compare At Price    float64
Template Suffix              object
Variant Inventory Policy     object
Variant Cost                float64
dtype: object 

As is clear, the DataFrames are structured the same, and have the same number of rows, having moved the function to after the removal of "Discontinued" Items.

My latest attempt was to use a mask comparing the two DataFrames.

# create a mask by comparing the two dataframes based on their index (ID column)
mask = df_merged.eq(df_prod_filtered.loc[df_merged.index])

# create a new column in df_merged that shows if there is a difference or not
df_merged['Diff'] = ~mask.all(axis=1)

However, this resulted in all of the rows being marked as the same (no differences in any cell).

To confirm this wasn't the case, I manually edited several cells in df_prod_filtered and repeated the process, getting another list with no apparent differences.

So this is where I'm stuck again. I need to compare two Dataframes, now called df_merged and df_prod_filtered, using their ID column as an Index. I need to either merge or generate a new dataframe, that only contains rows where the data from df_merged is different to the data in df_prod_filtered, and it needs to take the new row data from df_merged.

For example, if df_merged had the following:

ID,Handle,Variant Compare At Price,Template Suffix,Variant Inventory Policy,Variant Cost
6548971520023,128194,49.99,,deny,32.49
6548974206999,128356,9.99,,deny,6.49

and df_prod_filtered had

ID,Handle,Variant Compare At Price,Template Suffix,Variant Inventory Policy,Variant Cost
6548971520023,128194,49.99,,deny,32.49
6548974206999,128356,8.99,,deny,6.29

The New DataFrame should only contain

ID,Handle,Variant Compare At Price,Template Suffix,Variant Inventory Policy,Variant Cost
6548974206999,128356,9.99,,deny,6.49

Now given my novice nature with Python and Pandas, it's possible that my issue is elsewhere in the script, for example in how df_prod_filtered or df_merged is generated. So Here is the entire script with links redacted for reference:

import pandas as pd
import datetime
import requests
import re
import os

# Set filename as current date
now = datetime.datetime.now()
filename = now.strftime("%B%d") + '.csv';

# Define a function to remove the word 'Brick' from the title
def remove_brick(title):
    if 'Booster' in title and 'Brick' in title:
        return title.replace('Brick', '').strip()
    else:
        return title
        
# Clear existing file        
if os.path.exists('stockfeed.csv'):
    os.remove('stockfeed.csv')
    
# Download today's stockfeed.
url = '<

<details>
<summary>英文:</summary>

I&#39;m very new to using Python, and have been working on a script that will allow me to download a csv, clean the data and create an output file that can be imported correctly into a shopify store. 

So far, I&#39;ve managed to accomplish this, generating a daily import file that contains the relevant data to import into my shop. 

What I want to do to further enhance the code, is have the output file filtered to only include rows where there has been a change. In other words, if a products cost or price changes, or it comes out of pre-order status, I want it to be included, but if the items lines are the same in both the product list and the latest stockfeed it should be excluded from the import. 

**Update**: I&#39;ve been working on this for the last few days, and I think I&#39;m closer but I&#39;m still not able to get that last step out of the way. 

I&#39;ve gotten to the point that I&#39;ve created two DataFrames for the comparison: **df_merged** and **df_prod_filtered**. 

I&#39;ve generated the following from the script for troubleshooting proposes: 

```df_merged
Columns:  [&#39;ID&#39;, &#39;Handle&#39;, &#39;Variant Compare At Price&#39;, &#39;Template Suffix&#39;, &#39;Variant Inventory Policy&#39;, &#39;Variant Cost&#39;]
Index column of df_merged:  None
Number of rows:  1802
Column types:
 ID                            int64
Handle                       object
Variant Compare At Price    float64
Template Suffix              object
Variant Inventory Policy     object
Variant Cost                float64
dtype: object

df_prod_filtered
Columns:  [&#39;ID&#39;, &#39;Handle&#39;, &#39;Variant Compare At Price&#39;, &#39;Template Suffix&#39;, &#39;Variant Inventory Policy&#39;, &#39;Variant Cost&#39;]
Index column of df_prod_filtered:  None
Number of rows:  1802
Column types:
 ID                            int64
Handle                       object
Variant Compare At Price    float64
Template Suffix              object
Variant Inventory Policy     object
Variant Cost                float64
dtype: object 

As is clear, the DataFrames are structured the same, and have the same number of rows, having moved the function to after the removal of "Discontinued" Items.

My latest attempt was to use a mask comparing the two DataFrames.

# create a mask by comparing the two dataframes based on their index (ID column)
mask = df_merged.eq(df_prod_filtered.loc[df_merged.index])
# create a new column in df_merged that shows if there is a difference or not
df_merged[&#39;Diff&#39;] = ~mask.all(axis=1)

However, this resulted in all of the rows being marked as the same (no differences in any cell).

To confirm this wasn't the case, I manually edited several cells in df_prod_filtered and repeated the process, getting another list with no apparent differences.

So this is where I'm stuck again. I need to compare two Dataframes, now called df_merged and df_prod_filtered, using there ID column as an Index. I need to either merge or generate a new dataframe, that only contains rows where the data from df_merged is different to the data in df_prod_filtered, and it needs to take the new row data from df_merged.

For example, if df_merged had the following:

ID,Handle,Variant Compare At Price,Template Suffix,Variant Inventory Policy,Variant Cost
6548971520023,128194,49.99,,deny,32.49
6548974206999,128356,9.99,,deny,6.49

and df_prod_filtered had

ID,Handle,Variant Compare At Price,Template Suffix,Variant Inventory Policy,Variant Cost
6548971520023,128194,49.99,,deny,32.49
6548974206999,128356,8.99,,deny,6.29

The New DataFrame should only contain

ID,Handle,Variant Compare At Price,Template Suffix,Variant Inventory Policy,Variant Cost
6548974206999,128356,9.99,,deny,6.49

Now given my novice nature with Python and Pandas, its possible that my issue is elsewhere in the script, for example in how df_prod_filtered or df_merged is generated. So Here is the entire script with links redacted for reference:

import pandas as pd
import datetime
import requests
import re
import os
# Set filename as current date
now = datetime.datetime.now()
filename = now.strftime(&quot;%B%d&quot;) + &#39;.csv&#39;
# Define a function to remove the word &#39;Brick&#39; from the title
def remove_brick(title):
if &#39;Booster&#39; in title and &#39;Brick&#39; in title:
return title.replace(&#39;Brick&#39;, &#39;&#39;).strip()
else:
return title
# Clear existing file        
if os.path.exists(&#39;stockfeed.csv&#39;):
os.remove(&#39;stockfeed.csv&#39;)
# Download todays stockfeed.
url = &#39;&lt;CSV URL Goes HERE&gt;&#39;
response = requests.get(url)
if response.status_code == 200:
with open(&#39;stockfeed.csv&#39;, &#39;wb&#39;) as f:
f.write(response.content)
else:
print(&#39;Failed to download CSV file&#39;)
# Load the CSV file into a pandas DataFrame
df_a = pd.read_csv(&#39;stockfeed.csv&#39;)
# Keep only the required columns
df_a = df_a[[&#39;item_number&#39;, &#39;name&#39;, &#39;image_path&#39;, &#39;rrp&#39;, &#39;description&#39;, &#39;barcode&#39;, &#39;manufacturer_sku&#39;, &#39;availability&#39;, &#39;publisher&#39;, &#39;price_ex_gst&#39;, &#39;item_group&#39;, &#39;board_game_genre&#39;, &#39;weight_kg&#39;, &#39;game_family&#39;]]
# Rename the columns
df_a = df_a.rename(columns={&#39;item_number&#39;: &#39;Handle&#39;, &#39;name&#39;: &#39;Title&#39;,&#39;image_path&#39;: &#39;Image Src&#39;, &#39;description&#39;: &#39;Body HTML&#39;, &#39;manufacturer_sku&#39;: &#39;Variant SKU&#39;,&#39;barcode&#39;: &#39;Variant Barcode&#39;, &#39;weight_kg&#39;: &#39;Variant Weight&#39;, &#39;price_ex_gst&#39;: &#39;Variant Cost&#39;, &#39;publisher&#39;: &#39;Vendor&#39;})
# Filter the rows based on criteria
df_a = df_a[(df_a[&#39;availability&#39;] == &#39;Pre-Order&#39;)
&amp; (~df_a[&#39;item_group&#39;].isin([&#39;Board Games&#39;, &#39;CCG&#39;, &#39;Puzzles&#39;]))
&amp; (~df_a[&#39;game_family&#39;].isin([&#39;Traveller&#39;, &#39;Keyforge&#39;, &#39;SLA Industries&#39;, &#39;Successors&#39;, &#39;Dungeon Crawl&#39;, &#39;Judge Dredd&#39;, &#39;KULT RPG&#39;, &#39;War of the Ring&#39;, &#39;Rocketmen&#39;, &#39;Great Wyrms of Draka&#39;, &#39;Epic Card Game&#39;, &#39;13th Age RPG&#39;, &#39;A Game of Thrones A Song of Ice and Fire&#39;, &#39;A Song of Ice and Fire&#39;, &#39;Achtung Cthulhu&#39;, &#39;Achtung Cthulhu 2d20&#39;, &#39;Achtung! Cthulhu Miniatures&#39;, &#39;Adventures &amp; Academia&#39;, &#39;Alien RPG&#39;, &#39;Ashen Stars RPG&#39;, &#39;Avatar Legends&#39;, &#39;Battletech&#39;, &#39;BeyBlade&#39;, &#39;Bicycle&#39;, &#39;Black Void RPG&#39;, &#39;Blue Rose RPG&#39;, &#39;Boss Monster&#39;, &#39;Call to Adventure&#39;, &#39;Castles and Crusades RPG&#39;, &#39;Conan RPG&#39;, &#39;Corolis RPG&#39;, &#39;Cypher&#39;, &#39;Dark Souls&#39;, &#39;Digimon Card Game&#39;, &#39;Doctor Who&#39;, &#39;Divinity&#39;, &#39;Disney&#39;, &#39;Dungeon Crawl Classics&#39;, &#39;Dungeonology&#39;, &#39;Elder Scrolls Call to Arms&#39;, &#39;Fallout RPG&#39;, &#39;Fallout Wasteland Warfare&#39;, &#39;Fantasy AGE&#39;, &#39;Fear Itself RPG&#39;, &#39;Fire &amp; Stone&#39;, &#39;Folklore&#39;, &#39;Forbidden Lands RPG&#39;, &#39;Forbidden Lands&#39;, &#39;Fragged Empire RPG&#39;, &#39;G.I. Joe&#39;, &#39;GameMastery&#39;, &#39;Galaxy Defenders&#39;, &#39;Gatekeeper Dice&#39;, &#39;Halfsies Dice&#39;, &#39;Hero Realms&#39;, &#39;Heroclix&#39;, &#39;Homeworld Revelations RPG&#39;, &#39;Hunter: The Reckoning&#39;, &#39;Infinity RPG&#39;, &#39;John Carter of Mars RPG&#39;, &#39;Jack Vance RPG&#39;, &#39;Invisible Sun&#39;, &#39;Kem Arrow&#39;, &#39;Kids on Bikes&#39;, &#39;Knights of the Round&#39;, &#39;Kobolds Ate My Baby&#39;, &#39;Lamentations RPG&#39;, &#39;Last Aurora&#39;, &#39;Liminal RPG&#39;, &#39;Lord of the Rings RPG&#39;, &#39;Masks&#39;, &#39;Metamorphosis Alpha&#39;, &#39;Modern Age RPG&#39;, &#39;Mutant City Blues RPG&#39;, &#39;Mork Borg RPG&#39;, &#39;Mutant Crawl Classics&#39;, &#39;Mutant Year Zero RPG&#39;, &#39;Mutants &amp; Masterminds&#39;, &#39;My Little Pony&#39;, &#39;Mythos RPG&#39;, &#39;Nerf&#39;, &#39;Night\&#39;s Black Agents RPG&#39;, &#39;Numenera&#39;, &#39;Odyssey of the Dragonlords RPG&#39;, &#39;One Piece&#39;, &#39;Original Adventures Reincarnated&#39;, &#39;Overlight&#39;, &#39;Paladins of the Western Kingdoms&#39;, &#39;Paradox Initiative&#39;, &#39;Pasion de las Pasiones&#39;, &#39;Pirate Borg&#39;, &#39;Planegea RPG&#39;, &#39;PolyHero&#39;, &#39;Power Rangers&#39;, &#39;Robin Laws RPG&#39;, &#39;Rocketmen&#39;, &#39;Ruins of Symbaroum&#39;, &#39;Ruins of Symbaroum RPG&#39;, &#39;Shadowrun&#39;, &#39;Shadows of Brimstone&#39;, &#39;Sorcerer&#39;, &#39;Sorcerer\&#39;s Arena&#39;, &#39;Spirograph&#39;, &#39;Sprue Wave 2&#39;, &#39;Star Realms&#39;, &#39;Star Trek Adventures&#39;, &#39;Star Wars X Wing&#39;, &#39;Starfinder&#39;, &#39;Stargate SG-1 RPG&#39;, &#39;Symbaroum RPG&#39;, &#39;Symbaroum&#39;, &#39;Tales from the Loop RPG&#39;, &#39;Symbaroum RPG - Thistle Hold&#39;, &#39;Tally Ho&#39;, &#39;The Art of&#39;, &#39;The Borellus&#39;, &#39;The Esoterrorists RPG&#39;, &#39;The Excellents RPG&#39;, &#39;The Fantasy Trip&#39;, &#39;The Lost Citadel RPG&#39;, &#39;The Spy Game RPG&#39;, &#39;The One Ring RPG&#39;, &#39;The Strange&#39;, &#39;The Yellow King RPG&#39;, &#39;Things from the Flood RPG&#39;, &#39;Trail of Cthulhu RPG&#39;, &#39;TimeWatch RPG&#39;, &#39;Transformers&#39;, &#39;Tripods &amp; Triplanes&#39;, &#39;Twilight 2000&#39;, &#39;U-Boot&#39;, &#39;Upzone&#39;, &#39;Vaesen Nordic Horror&#39;, &#39;Vornheim RPG The Complete City&#39;, &#39;Vurt RPG&#39;, &#39;Warhammer Fantasy Roleplay&#39;, &#39;World of Tanks&#39;, &#39;World War Cthulhu&#39;, &#39;Yggdrasil&#39;]))
&amp; (~df_a[&#39;Image Src&#39;].isin([&#39;https://letsplaygames.com.au/media//catalog/product/placeholder/default/Placeholder_Image-_LPG_Transparent.png&#39;]))
&amp; (~df_a[&#39;Title&#39;].str.contains(&#39;Coriolis|Power Rangers|Starfinder|Counter|Dice Cups|Dice Bag|Marvel|Homeworld Revelations|LUGU|Card Game|Essence20 Roleplaying System|Infinity Collectible|Class Deck|Castle Falkenstein|Transformers|LPG|Stand with Paints|G.I. Joe|Haunted West|Playing Cards|Metal Coasters|Teenagers From Outer Space|Mekton Zeta|Core Fuzion|Display|Poster|Everyday Heroes RPG&#39;, case=False))
&amp; (~df_a[&#39;Vendor&#39;].isin([&#39;Wizards of the Coast&#39;, &#39;Arcane Tinmen&#39;, &#39;Steve Jackson Games&#39;, &#39;Ultra Pro&#39;, &#39;Akora Cards&#39;, &#39;Word Forge Games&#39;, &#39;Ultimate Guard&#39;, &#39;The Op&#39;, &#39;Studio 9 Games&#39;, &#39;Steamforged Games&#39;, &#39;Rebellion Unplugged&#39;, &#39;Plaid Hat Games&#39;, &#39;Pinfinity&#39;, &#39;Pelgrane Press&#39;, &#39;Monte Cook Games&#39;, &#39;Lynnvander Studios&#39;, &#39;Mantic Games&#39;, &#39;Loke BattleMats&#39;, &#39;Lamentations of the Flame&#39;, &#39;Goodman Games&#39;, &#39;Green Ronin Publishing&#39;, &#39;Funko&#39;, &#39;Gamelyn Games&#39;, &#39;Fantasy Flight Games&#39;, &#39;Edge Studios&#39;, &#39;Darrington Press&#39;, &#39;CMON&#39;, &#39;Chaosium&#39;, &#39;Battle Systems&#39;, &#39;Black Site Studios&#39;, &#39;Bandai&#39;, &#39;Atomic Overmind Press&#39;, &#39;Atomic Mass Games&#39;, &#39;Atlas Games&#39;, &#39;Archon Studio&#39;, &#39;Archon Games&#39;, &#39;9th Level Games&#39;, &#39;Modiphius Entertainment&#39;, &#39;Troll Lord Games&#39;, &#39;&#39;,&#39;Two Little Mice&#39;, &#39;Dark Horse Books&#39;]))]
df_a = df_a[~((df_a[&#39;Vendor&#39;].isna()) &amp; (df_a[&#39;Variant SKU&#39;].str.contains(&#39;BPG0&#39;)))]
# Generate New Columns
df_a[&#39;Variant Compare At Price&#39;] = round(df_a[&#39;rrp&#39;] / 1.1, 4)
df_a[&#39;Variant Price&#39;] = df_a[&#39;Variant Compare At Price&#39;]
df_a[&#39;Template Suffix&#39;] = df_a[&#39;availability&#39;].apply(lambda x: &#39;pre-order&#39; if x == &#39;Pre-Order&#39; else &#39;&#39;)
df_a[&#39;Variant Inventory Policy&#39;] = df_a[&#39;availability&#39;].apply(lambda x: &#39;continue&#39; if x == &#39;Pre-Order&#39; else &#39;deny&#39;)
df_a[&#39;Type&#39;] = &#39;&#39;
df_a[&#39;Tags&#39;] = &#39;&#39;
df_a[&#39;Variant Weight Unit&#39;] = &#39;kg&#39;
df_a[&#39;Variant Taxable&#39;] = &#39;TRUE&#39;
df_a[&#39;Variant Requires Shipping&#39;] = &#39;TRUE&#39;
df_a[&#39;Variant Inventory Tracker&#39;] = &#39;shopify&#39;
# Set Product Keywords
rpgkey = [&#39;Sourcebook&#39;, &#39;Kobold Press&#39;, &#39;Adventure Path&#39;, &#39;Campaign Setting&#39;, &#39;Pathfinder First Edition&#39;, &#39;Pathfinder Second Edition&#39;, &#39;Cyberpunk 2020&#39;, &#39;Cyberpunk RED&#39;,&#39;Vampire: The Masquarade&#39;]
paintkey = [&#39;AK Interactive Auxiliaries&#39;, &#39;Primers&#39;, &#39;Metallics&#39;, &#39;Tones&#39;, &#39;Primer&#39;, &#39;Paint Thinner&#39;, &#39;Acrylic&#39;, &#39;Acrylics&#39;, &#39;Paint Stripper&#39;, &#39;Paint Set&#39;, &#39;Speedpaint&#39;, &#39;3Gen Sets&#39;, &#39;Dual Exo Set&#39;, &#39;Pigments&#39;, &#39;Aerosol&#39;, &#39;Colour Set&#39;, &#39;Color Set&#39;]
miniaturekey = [&#39;D&amp;D Classic Collection&#39;, &#39;Icons of the Realms&#39;, &#39;Action Figure&#39;, &#39;Miniatures&#39;,&#39;Idols of the Realms&#39;, &#39;Miniature&#39;, &#39;Pathfinder Battles&#39;, &#39;D&amp;D Frameworks&#39;, &#39;TinkerTurf&#39;, &#39;Wizkids Dungeon Dressings&#39;, &#39;Deep Cuts&#39;]
colectablekey = [&#39;Medallion&#39;, &#39;Life-Sized&#39;, &#39;Replicas of the Realms&#39;, &#39;Trophy&#39;, &#39;Plaque&#39;, &#39;Dicelings&#39;, &#39;D&amp;D Cartoon Classics&#39;, &#39;Collectibles&#39;,&#39;Collectible&#39;, &#39;Collectables&#39;, &#39;Collectable&#39;, &#39;Statue&#39;, &#39;Plush&#39;]
accessorykey = [&#39;Mat&#39;, &#39;Playmat&#39;, &#39;Paintbrush&#39;, &#39;Brush&#39;, &#39;Tool&#39;, &#39;Tray&#39;, &#39;Palette&#39;, &#39;Glue&#39;, &#39;Battlemat&#39;, &#39;Megamat&#39;, &#39;Brush Set&#39;, &#39;Insert&#39;, &#39;Drybrush&#39;, &#39;Folio&#39;, &#39;Pathfinder Accessories&#39;, &#39;Flip-Mat&#39;, &#39;Dry Brushes&#39;, &#39;Marker&#39;, &#39;Sheets&#39;, &#39;Brushes Set&#39;] 
dicekey = [&#39;7-Die&#39;, &#39;D20&#39;, &#39;D12&#39;, &#39;D10&#39;, &#39;D100&#39;, &#39;D4&#39;, &#39;d6&#39;, &#39;Dice Set&#39;, &#39;Dice&#39;] 
pbkey = [&#39;Brush&#39;, &#39;PaintBrush&#39;]
monkey = [&#39;Monument Hobbies&#39;]
mlkey = [&#39;ml&#39;]
# Set Product Types 
df_a.loc[df_a[&#39;Title&#39;].str.contains(&#39;|&#39;.join([rf&#39;\b{re.escape(kw)}\b&#39; for kw in paintkey]), case=False), &#39;Type&#39;] = &#39;Paint&#39;
df_a.loc[df_a[&#39;Title&#39;].str.contains(&#39;|&#39;.join([rf&#39;\b{re.escape(kw)}\b&#39; for kw in rpgkey]), case=False), &#39;Type&#39;] = &#39;RPG Book&#39;
df_a.loc[df_a[&#39;Title&#39;].str.contains(&#39;|&#39;.join([rf&#39;\b{re.escape(kw)}\b&#39; for kw in miniaturekey]), case=False), &#39;Type&#39;] = &#39;Miniature&#39;
df_a.loc[df_a[&#39;Title&#39;].str.contains(&#39;|&#39;.join([rf&#39;\b{re.escape(kw)}\b&#39; for kw in colectablekey]), case=False), &#39;Type&#39;] = &#39;Collectable&#39;
df_a.loc[df_a[&#39;Title&#39;].str.contains(&#39;|&#39;.join([rf&#39;\b{re.escape(kw)}\b&#39; for kw in accessorykey]), case=False), &#39;Type&#39;] = &#39;Accessory&#39;
df_a.loc[df_a[&#39;Title&#39;].str.contains(&#39;|&#39;.join([rf&#39;\b{re.escape(kw)}\b&#39; for kw in dicekey]), case=False), &#39;Type&#39;] = &#39;Dice&#39;
df_a.loc[(df_a[&#39;Title&#39;].str.contains(&#39;Critical Role&#39;)) &amp; (df_a[&#39;Title&#39;].str.contains(&#39;Boxed Set&#39;)), &#39;Type&#39;] = &#39;Miniature&#39;
df_a.loc[df_a[&#39;Vendor&#39;] == &#39;Monument Hobbies&#39;, &#39;Type&#39;] = &#39;Accessory&#39;
df_a.loc[df_a[&#39;Title&#39;].str.contains(&#39;|&#39;.join(mlkey), case=False), &#39;Type&#39;] = &#39;Paint&#39;
# Remove invalid product types. 
df_a = df_a.dropna(subset=[&#39;Type&#39;])
#Define paint colors
paint_colors =  [&#39;Black&#39;, &#39;Blue&#39;, &#39;Brown&#39;, &#39;Green&#39;, &#39;Grey&#39;, &#39;Orange&#39;, &#39;Purple&#39;, &#39;Red&#39;, &#39;Yellow&#39;, &#39;Wood&#39;, &#39;Marble&#39;, &#39;Ashes&#39;, &#39;Dirt&#39;, &#39;Earth&#39;, &#39;Soil&#39;, &#39;Dust&#39;, &#39;Flesh&#39;, &#39;Topaz&#39;, &#39;Rust&#39;, &#39;Acid&#39;, &#39;Magic&#39;, &#39;Arcane&#39;, &#39;Frost&#39;, &#39;Blood&#39;, &#39;Moss&#39;, &#39;Varnish&#39;, &#39;Wash&#39;, &#39;Thinner&#39;, &#39;Satin&#39;, &#39;Umber&#39;, &#39;Sepia&#39;, &#39;Violet&#39;, &#39;Ink&#39;, &#39;Fluorescent&#39;, &#39;Magenta&#39;, &#39;White&#39;, &#39;Skin&#39;, &#39;Brass&#39;, &#39;Gold&#39;, &#39;Bronze&#39;, &#39;Copper&#39;, &#39;Mystic&#39;, &#39;Gloomy&#39;, &#39;Deep&#39;, &#39;Martian&#39;, &#39;Nuclear&#39;, &#39;Imperial&#39;, &#39;Plague&#39;, &#39;Space&#39;, &#39;Lotus&#39;, &#39;Cardinal&#39;, &#39;Velvet&#39;, &#39;Plasma&#39;, &#39;Magma&#39;, &#39;Lava&#39;, &#39;Metal&#39;, &#39;Metallic&#39;, &#39;Stone&#39;, &#39;Neutral&#39;, &#39;Concrete&#39;, &#39;Glacier&#39;, &#39;Glue&#39;, &#39;Crystal&#39;, &#39;Antishine&#39;, &#39;Enamel&#39;]
# define functions to split tags and generate new tags column
def split_tags(tags):
return [tag.strip() for tag in re.findall(r&#39;&quot;[^&quot;]+&quot;|\w+&#39;, tags)]
def generate_tags(row):
tags = []
if row[&#39;Type&#39;] == &#39;Paint&#39; and (&#39;Set&#39; in row[&#39;Title&#39;] or &#39;Sets&#39; in row[&#39;Title&#39;] or &#39;Briefcase&#39; in row[&#39;Title&#39;] or &#39;Case&#39; in row[&#39;Title&#39;]):
tags.append(&#39;Paint Set&#39;)
if row[&#39;Type&#39;] == &#39;Miniature&#39; and &#39;Booster&#39; in row[&#39;Title&#39;]:
tags.append(&#39;Booster&#39;)
if row[&#39;Type&#39;] == &#39;Miniature&#39; and &#39;Icons of the Realm&#39; in row[&#39;Title&#39;]:
tags.append(&#39;Painted&#39;)
tags.append(&#39;Icons of the Realm&#39;)
if &#39;D&amp;D&#39; in row[&#39;Title&#39;] or &#39;Dungeons &amp; Dragons&#39; in row[&#39;Title&#39;]:
tags.append(&#39;D&amp;D&#39;)
if &#39;Aerosol&#39; in row [&#39;Title&#39;] or &#39;Spray&#39; in row [&#39;Title&#39;]:
tags.append(&#39;Aerosol&#39;)
if &#39;Critical Role&#39; in row[&#39;Title&#39;]:
tags.append(&#39;Critical Role&#39;)   
if &#39;Cyberpunk RED&#39; in row[&#39;Title&#39;]:
tags.append(&#39;Cyberpunk RED&#39;)  
if &#39;Cyberpunk 2020&#39; in row[&#39;Title&#39;]:
tags.append(&#39;Cyberpunk 2020&#39;)  
if &#39;Cyberpunk&#39; in row[&#39;Title&#39;]:
tags.append(&#39;Cyberpunk&#39;)          
if &#39;Honor Among Thieves&#39; in row[&#39;Title&#39;]:
tags.append(&#39;Honor Among Thieves&#39;)  
if &#39;Painted&#39; in row[&#39;Title&#39;]:
tags.append(&#39;Painted&#39;)  
if &#39;Unpainted&#39; in row[&#39;Title&#39;] or &#39;Collectors Series&#39; in row [&#39;Title&#39;] or &#39;Nolzurs Marvelous Miniatures&#39; in row [&#39;Title&#39;] or &#39;Deep Cuts&#39; in row [&#39;Title&#39;]:
tags.append(&#39;Unpainted&#39;)  
if &#39;Vampire: The Masquerade&#39; in row[&#39;Title&#39;]:
tags.append(&#39;VTM&#39;)  
if &#39;Nolzurs Marvelous&#39; in row[&#39;Title&#39;]:
tags.append(&#39;Nolzurs Marvelous&#39;)  
if &#39;Pre-Order&#39; in row[&#39;availability&#39;]:
tags.append(&#39;Pre-Order&#39;)  
if &#39;d6&#39; in row[&#39;Title&#39;] and &#39;Block&#39; in row [&#39;Title&#39;]:
tags.append(&#39;D6 Set&#39;)
if &#39;Dice Set 7&#39; in row[&#39;Title&#39;] or &#39;7-Die Set&#39; in row [&#39;Title&#39;]:
tags.append(&#39;RPG Set&#39;)
if &#39;battlemat&#39; in row[&#39;Title&#39;] or &#39;playmat&#39; in row[&#39;Title&#39;] or&#39;megamat&#39; in row[&#39;Title&#39;] or&#39;flip mat&#39; in row[&#39;Title&#39;] or&#39; mat&#39; in row[&#39;Title&#39;] or &#39;Flip-mat&#39; in row[&#39;Title&#39;] or &#39;Hobby Mat&#39; in row[&#39;Title&#39;] or &#39;Hobby Mat&#39; in row[&#39;Title&#39;]:
tags.append(&#39;Pre-Order&#39;)
if &#39;Drybrush&#39; in row[&#39;Title&#39;] or &#39;Paintbrush&#39; in row[&#39;Title&#39;]:
tags.append(&#39;Paint Brush&#39;)         
if &#39;Brush Set&#39; in row[&#39;Title&#39;] or &#39;Brushes&#39; in row[&#39;Title&#39;]:
tags.append(&#39;Paint Brush&#39;)
tags.append(&#39;Brush Set&#39;)
if &#39;Pathfinder&#39; in row[&#39;Title&#39;]:
tags.append(&#39;Pathfinder&#39;)
if &#39;Warhammer&#39; in row[&#39;Title&#39;]:
tags.append(&#39;Warhammer&#39;)
if &#39;Idols of the Realm&#39; in row[&#39;Title&#39;]:
tags.append(&#39;2D&#39;)
tags.append(&#39;Idols of the Realm&#39;)
if &#39;Replicas of the Realm&#39; in row[&#39;Title&#39;]:
tags.append(&#39;Replicas of the Realm&#39;)
tags.append(&#39;Collectable&#39;)
if &#39;Exandria&#39; in row[&#39;Title&#39;]:
tags.append(&#39;Exandria&#39;)
if &#39;Dungeon Dressings&#39; in row[&#39;Title&#39;]:
tags.append(&#39;Dungeon Dressings&#39;)
if &#39;Portal 2&#39; in row[&#39;Title&#39;]:
tags.append(&#39;Portal 2&#39;)
if row[&#39;Type&#39;] == &#39;Paint&#39;:
# Check for color in title
for color in paint_colors:
if color in row[&#39;Title&#39;]:
tags.append(color)
# Check for color in description
for color in paint_colors:
if color in row[&#39;Body HTML&#39;]:
tags.append(color)
return &quot;,&quot;.join(tags)
df_a[&#39;Tags&#39;] = df_a.apply(generate_tags, axis=1)
# Remove unwanted columns
df_a = df_a.drop([&#39;rrp&#39;, &#39;availability&#39;], axis=1)
# Drop rows with missing values in the &quot;Type&quot; column
df_a = df_a.dropna(subset=[&#39;Type&#39;])
# Import the new CSV as df_b
df_b = pd.read_csv(&#39;Products.csv&#39;)
# Select only the &#39;Handle&#39; column from df_b
df_b = df_b[[&#39;Handle&#39;]]
# Convert Handle column to string
df_a[&#39;Handle&#39;] = df_a[&#39;Handle&#39;].astype(str)
df_b[&#39;Handle&#39;] = df_b[&#39;Handle&#39;].astype(str)
# Add &quot;-single&quot; to the handle for rows with &quot;Booster&quot; or &quot;Brick&quot; in the title
mask = df_a[&#39;Title&#39;].str.contains(&#39;Booster Brick&#39;)
mask &amp;= ~df_a[&#39;Handle&#39;].str.endswith(&#39;-single&#39;) # Exclude rows with &#39;-single&#39; already present in the Handle
df_a.loc[mask, &#39;Handle&#39;] = df_a.loc[mask, &#39;Handle&#39;].astype(str) + &#39;-single&#39;
# Divide Prices of Boosters to single amounts. 
df_a.loc[df_a[&#39;Title&#39;].str.contains(&#39;Booster|Brick&#39;), [&#39;Variant Price&#39;, &#39;Variant Compare At Price&#39;, &#39;Variant Cost&#39;]] /= 8
# Filter out rows where &#39;Handle&#39; value is already in df_b
df_a = df_a[~df_a[&#39;Handle&#39;].isin(df_b[&#39;Handle&#39;])]
# Reorder the columns
df_a = df_a[[&#39;Handle&#39;,&#39;Title&#39;,&#39;Body HTML&#39;,&#39;Vendor&#39;,&#39;Type&#39;,&#39;Tags&#39;,&#39;Template Suffix&#39;,&#39;Variant SKU&#39;,&#39;Variant Barcode&#39;,&#39;Variant Weight&#39;,&#39;Variant Weight Unit&#39;,&#39;Variant Price&#39;,&#39;Variant Compare At Price&#39;,&#39;Variant Taxable&#39;,&#39;Variant Inventory Policy&#39;,&#39;Variant Requires Shipping&#39;,&#39;Variant Cost&#39;,&#39;Image Src&#39;]]
#Export New Products List
df_a.to_csv(&#39;NewProducts&#39; + filename, index=False)
# Creating the Update Product List
# Load the CSV files into a pandas DataFrame
df_a = pd.read_csv(&#39;stockfeed.csv&#39;)
df_b = pd.read_csv(&#39;Products.csv&#39;)[[&#39;ID&#39;, &#39;Handle&#39;]]
df_prod = pd.read_csv(&#39;Products.csv&#39;)[[&#39;ID&#39;, &#39;Handle&#39;, &#39;Variant Compare At Price&#39;, &#39;Template Suffix&#39;, &#39;Variant Inventory Policy&#39;, &#39;Variant Cost&#39;]]
# Keep only the required columns
df_a = df_a[[&#39;item_number&#39;, &#39;rrp&#39;, &#39;availability&#39;, &#39;price_ex_gst&#39;]]
# Rename the columns
df_a = df_a.rename(columns={&#39;item_number&#39;: &#39;Handle&#39;, &#39;price_ex_gst&#39;: &#39;Variant Cost&#39;})
# Create new columns
df_a[&#39;Variant Compare At Price&#39;] = round(df_a[&#39;rrp&#39;] / 1.1, 2)
df_a[&#39;Template Suffix&#39;] = df_a[&#39;availability&#39;].apply(lambda x: &#39;pre-order&#39; if x == &#39;Pre-Order&#39; else &#39;&#39;)
df_a[&#39;Variant Inventory Policy&#39;] = df_a[&#39;availability&#39;].apply(lambda x: &#39;continue&#39; if x == &#39;Pre-Order&#39; else &#39;deny&#39;)
# Remove unwanted columns
df_a = df_a.drop([&#39;rrp&#39;, &#39;availability&#39;], axis=1)
# Reorder the columns
df_a = df_a[[&#39;Handle&#39;, &#39;Variant Compare At Price&#39;, &#39;Template Suffix&#39;, &#39;Variant Inventory Policy&#39;, &#39;Variant Cost&#39;]]
# Convert Handle column in CSV A to string
df_a[&#39;Handle&#39;] = df_a[&#39;Handle&#39;].astype(str)
# Merge CSV files on Handle column
df_merged = pd.merge(df_a, df_b, on=&#39;Handle&#39;)
# Reorder the columns
df_merged = df_merged[[&#39;ID&#39;, &#39;Handle&#39;, &#39;Variant Compare At Price&#39;, &#39;Template Suffix&#39;, &#39;Variant Inventory Policy&#39;, &#39;Variant Cost&#39;]]
# Round Variant Compare At Price to 2 decimal places
df_merged[&#39;Variant Compare At Price&#39;] = df_merged[&#39;Variant Compare At Price&#39;].round(2)
# Export merged CSV file
df_merged.to_csv(&#39;ChangedProducts&#39; + filename, index=False)
# Read in the Handle columns from df_b and df_merged
df_c = df_b[[&#39;Handle&#39;]]
df_d = df_merged[[&#39;Handle&#39;]]
# Identify discontinued Handles
df_discon = df_c[~df_c[&#39;Handle&#39;].isin(df_d[&#39;Handle&#39;])]
# Export discontinued Handles to CSV
df_discon.to_csv(&#39;UnmatchedProducts&#39; + filename, index=False)
# start filter test 
# create a list of the handles in df_discon
discon_handles = df_discon[&#39;Handle&#39;].tolist()
# remove rows from df_prod that have handles in discon_handles
df_prod_filtered = df_prod[~df_prod[&#39;Handle&#39;].isin(discon_handles)]
df_prod_filtered = df_prod_filtered.reset_index(drop=True)
# Print column headings, number of rows, and column types for df_merged
print(&quot;df_merged&quot;)
print(&quot;Columns: &quot;, df_merged.columns.tolist())
print(&quot;Index column of df_merged: &quot;, df_merged.index.name)
print(&quot;Number of rows: &quot;, len(df_merged))
print(&quot;Column types: \n&quot;, df_merged.dtypes)
# Print column headings, number of rows, and column types for df_prod_filtered
print(&quot;\ndf_prod_filtered&quot;)
print(&quot;Columns: &quot;, df_prod_filtered.columns.tolist())
print(&quot;Index column of df_prod_filtered: &quot;, df_prod_filtered.index.name)
print(&quot;Number of rows: &quot;, len(df_prod_filtered))
print(&quot;Column types: \n&quot;, df_prod_filtered.dtypes)
# create a mask by comparing the two dataframes based on their index (ID column)
mask = df_merged.eq(df_prod_filtered.loc[df_merged.index])
# create a new column in df_merged that shows if there is a difference or not
df_merged[&#39;Diff&#39;] = ~mask.all(axis=1)
#print(df_merged)
# Print only rows where Diff is False and a string
#print(df_merged[df_merged[&quot;Diff&quot;] == &#39;False&#39;])
# Print only rows where Diff is False as a boolean
##df_merged = df_merged[df_merged[&quot;Diff&quot;] != True]
#print(df_merged)
# end Filter test

答案1

得分: 1

Ok! I figured this out and it turns out that probably many of the methods I had already tried may have resolved this issue.

The problem I was running into, without realizing it, is that the pandas considered NaN values different to other NaN values. that was it.

by replacing NaN values with a Placeholder string, I was able to create the dataframe with just the updated entries, and then replace the Placeholder with NaN values again using numpy.

I have yet to insert the new code into the overall script, but the specific goal/problem from this question is resolved, for reference, here is the script I created in testing and resolving this one:

import pandas as pd
import numpy as np

# Load the UpdateProducts and base product list dataframes
df_upd = pd.read_csv('ChangedProducts.csv')
df_prod = pd.read_csv('Products.csv')

df_prod = df_prod[['ID', 'Variant Compare At Price', 'Template Suffix', 'Variant Inventory Policy', 'Variant Cost']]
df_upd = df_upd[['ID', 'Variant Compare At Price', 'Template Suffix', 'Variant Inventory Policy', 'Variant Cost']]

# set index to ID column
df_prod.set_index('ID', inplace=True)
df_upd.set index('ID', inplace=True)

# replace nan values with placeholder
df_upd = df_upd.fillna('EMPTYCELLPLACEHOLDER')
df_prod = df_prod.fillna('EMPTYCELLPLACEHOLDER')

# compare the two dataframes element-wise
comparison = df_upd.eq(df_prod)

# get the rows where all values are True
all_same_rows = comparison.all(axis=1)

# get only the rows that are different
different_rows = df_upd[~all_same_rows]

# replace the placeholder
different_rows = different_rows.replace('EMPTYCELLPLACEHOLDER', np.nan)

# save the different rows to a new CSV file
different_rows.to_csv('DifferentRows.csv', index=True)
英文:

Ok! I figured this out and it turns out that probably many of the methods I had already tried may have resolved this issue.

The problem I was running into, without realizing it, is that the pandas considered NaN values different to other NaN values. that was it.

by replacing NaN values with a Placeholder string, I was able to create the dataframe with just the updated entries, and then replace the Placeholder with NaN values again using numpy.

I have yet to insert the new code into the overall script, but the specific goal/problem from this question is resolved, for refence, here is the script I created in testing and resolving this one:

import pandas as pd
import numpy as np
# Load the UpdateProducts and base product list dataframes
df_upd = pd.read_csv(&#39;ChangedProducts.csv&#39;)
df_prod = pd.read_csv(&#39;Products.csv&#39;)
df_prod = df_prod[[&#39;ID&#39;, &#39;Variant Compare At Price&#39;, &#39;Template Suffix&#39;, &#39;Variant Inventory Policy&#39;, &#39;Variant Cost&#39;]]
df_upd = df_upd[[&#39;ID&#39;, &#39;Variant Compare At Price&#39;, &#39;Template Suffix&#39;, &#39;Variant Inventory Policy&#39;, &#39;Variant Cost&#39;]]
# set index to ID column
df_prod.set_index(&#39;ID&#39;, inplace=True)
df_upd.set_index(&#39;ID&#39;, inplace=True)
# replace nan values with placeholder
df_upd = df_upd.fillna(&#39;EMPTYCELLPLACEHOLDER&#39;)
df_prod = df_prod.fillna(&#39;EMPTYCELLPLACEHOLDER&#39;)
# compare the two dataframes element-wise
comparison = df_upd.eq(df_prod)
# get the rows where all values are True
all_same_rows = comparison.all(axis=1)
# get only the rows that are different
different_rows = df_upd[~all_same_rows]
# replace the placeholder
different_rows = different_rows.replace(&#39;EMPTYCELLPLACEHOLDER&#39;, np.nan)
# save the different rows to a new CSV file
different_rows.to_csv(&#39;DifferentRows.csv&#39;, index=True)

huangapple
  • 本文由 发表于 2023年4月4日 11:19:29
  • 转载请务必保留本文链接:https://go.coder-hub.com/75925265.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定