Python | Machine Learning | Coding

Python | Machine Learning | Coding | R

Topic: Handling Datasets of All Types – Part 1 of 5: Introduction and Basic Concepts

---

1. What is a Dataset?

• A dataset is a structured collection of data, usually organized in rows and columns, used for analysis or training machine learning models.

---

2. Types of Datasets

• Structured Data: Tables, spreadsheets with rows and columns (e.g., CSV, Excel).

• Unstructured Data: Images, text, audio, video.

• Semi-structured Data: JSON, XML files containing hierarchical data.

---

3. Common Dataset Formats

• CSV (Comma-Separated Values)

• Excel (.xls, .xlsx)

• JSON (JavaScript Object Notation)

• XML (eXtensible Markup Language)

• Images (JPEG, PNG, TIFF)

• Audio (WAV, MP3)

---

4. Loading Datasets in Python

• Use libraries like pandas for structured data:

import pandas as pd
df = pd.read_csv('data.csv')

• Use libraries like json for JSON files:

import json
with open('data.json') as f:
    data = json.load(f)

---

5. Basic Dataset Exploration

• Check shape and size:

print(df.shape)

• Preview data:

print(df.head())

• Check for missing values:

print(df.isnull().sum())

---

6. Summary

• Understanding dataset types is crucial before processing.

• Loading and exploring datasets helps identify cleaning and preprocessing needs.

---

Exercise

• Load a CSV and JSON dataset in Python, print their shapes, and identify missing values.

---

#DataScience #Datasets #DataLoading #Python #DataExploration

The rest of the parts 👇
https://t.me/DataScienceM

🌟

Please open Telegram to view this post

VIEW IN TELEGRAM

❤27👍1

8.57K viewsedited 12:11

Python | Machine Learning | Coding | R

Top 100 Data Analysis Commands & Functions

#DataAnalysis #Pandas #DataLoading #Inspection

Part 1: Pandas - Data Loading & Inspection

#1. pd.read_csv()
Reads a comma-separated values (csv) file into a Pandas DataFrame.

import pandas as pd
from io import StringIO

csv_data = "col1,col2,col3\n1,a,True\n2,b,False"
df = pd.read_csv(StringIO(csv_data))
print(df)

col1 col2   col3
0     1    a   True
1     2    b  False

#2. df.head()
Returns the first n rows of the DataFrame (default is 5).

import pandas as pd
df = pd.DataFrame({'A': range(10), 'B': list('abcdefghij')})
print(df.head(3))

#3. df.tail()
Returns the last n rows of theDataFrame (default is 5).

import pandas as pd
df = pd.DataFrame({'A': range(10), 'B': list('abcdefghij')})
print(df.tail(3))

#4. df.info()
Prints a concise summary of a DataFrame, including data types and non-null values.

import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 2, np.nan], 'B': ['x', 'y', 'z']})
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A       2 non-null      float64
 1   B       3 non-null      object 
dtypes: float64(1), object(1)
memory usage: 176.0+ bytes

#5. df.describe()
Generates descriptive statistics for numerical columns.

import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3, 4, 5]})
print(df.describe())

A
count  5.000000
mean   3.000000
std    1.581139
min    1.000000
25%    2.000000
50%    3.000000
75%    4.000000
max    5.000000

#6. df.shape
Returns a tuple representing the dimensionality (rows, columns) of the DataFrame.

import pandas as pd
df = pd.DataFrame({'A': [1, 2], 'B': [3, 4], 'C': [5, 6]})
print(df.shape)

(2, 3)

#7. df.columns
Returns the column labels of the DataFrame.

import pandas as pd
df = pd.DataFrame({'Name': ['Alice'], 'Age': [30]})
print(df.columns)

Index(['Name', 'Age'], dtype='object')

#8. df.dtypes
Returns the data types of each column.

import pandas as pd
df = pd.DataFrame({'A': [1, 2], 'B': [1.1, 2.2], 'C': ['x', 'y']})
print(df.dtypes)

A      int64
B    float64
C     object
dtype: object

#9. df['col'].value_counts()
Returns a Series containing counts of unique values in a column.

import pandas as pd
df = pd.DataFrame({'Fruit': ['Apple', 'Banana', 'Apple', 'Orange', 'Banana', 'Apple']})
print(df['Fruit'].value_counts())

Apple     3
Banana    2
Orange    1
Name: Fruit, dtype: int64

#10. df['col'].unique()
Returns an array of the unique values in a column.

import pandas as pd
df = pd.DataFrame({'Fruit': ['Apple', 'Banana', 'Apple', 'Orange']})
print(df['Fruit'].unique())

['Apple' 'Banana' 'Orange']

#11. df['col'].nunique()
Returns the number of unique values in a column.

❤2

979 views17:02

About

Blog

Apps

Platform