Topic: Handling Datasets of All Types – Part 1 of 5: Introduction and Basic Concepts
---
1. What is a Dataset?
• A dataset is a structured collection of data, usually organized in rows and columns, used for analysis or training machine learning models.
---
2. Types of Datasets
• Structured Data: Tables, spreadsheets with rows and columns (e.g., CSV, Excel).
• Unstructured Data: Images, text, audio, video.
• Semi-structured Data: JSON, XML files containing hierarchical data.
---
3. Common Dataset Formats
• CSV (Comma-Separated Values)
• Excel (.xls, .xlsx)
• JSON (JavaScript Object Notation)
• XML (eXtensible Markup Language)
• Images (JPEG, PNG, TIFF)
• Audio (WAV, MP3)
---
4. Loading Datasets in Python
• Use libraries like
• Use libraries like
---
5. Basic Dataset Exploration
• Check shape and size:
• Preview data:
• Check for missing values:
---
6. Summary
• Understanding dataset types is crucial before processing.
• Loading and exploring datasets helps identify cleaning and preprocessing needs.
---
Exercise
• Load a CSV and JSON dataset in Python, print their shapes, and identify missing values.
---
#DataScience #Datasets #DataLoading #Python #DataExploration
The rest of the parts👇
https://t.me/DataScienceM🌟
---
1. What is a Dataset?
• A dataset is a structured collection of data, usually organized in rows and columns, used for analysis or training machine learning models.
---
2. Types of Datasets
• Structured Data: Tables, spreadsheets with rows and columns (e.g., CSV, Excel).
• Unstructured Data: Images, text, audio, video.
• Semi-structured Data: JSON, XML files containing hierarchical data.
---
3. Common Dataset Formats
• CSV (Comma-Separated Values)
• Excel (.xls, .xlsx)
• JSON (JavaScript Object Notation)
• XML (eXtensible Markup Language)
• Images (JPEG, PNG, TIFF)
• Audio (WAV, MP3)
---
4. Loading Datasets in Python
• Use libraries like
pandas for structured data:import pandas as pd
df = pd.read_csv('data.csv')
• Use libraries like
json for JSON files:import json
with open('data.json') as f:
data = json.load(f)
---
5. Basic Dataset Exploration
• Check shape and size:
print(df.shape)
• Preview data:
print(df.head())
• Check for missing values:
print(df.isnull().sum())
---
6. Summary
• Understanding dataset types is crucial before processing.
• Loading and exploring datasets helps identify cleaning and preprocessing needs.
---
Exercise
• Load a CSV and JSON dataset in Python, print their shapes, and identify missing values.
---
#DataScience #Datasets #DataLoading #Python #DataExploration
The rest of the parts
https://t.me/DataScienceM
Please open Telegram to view this post
VIEW IN TELEGRAM
❤27👍1
Top 100 Data Analysis Commands & Functions
#DataAnalysis #Pandas #DataLoading #Inspection
Part 1: Pandas - Data Loading & Inspection
#1.
Reads a comma-separated values (csv) file into a Pandas DataFrame.
#2.
Returns the first n rows of the DataFrame (default is 5).
#3.
Returns the last n rows of theDataFrame (default is 5).
#4.
Prints a concise summary of a DataFrame, including data types and non-null values.
#5.
Generates descriptive statistics for numerical columns.
#6.
Returns a tuple representing the dimensionality (rows, columns) of the DataFrame.
#7.
Returns the column labels of the DataFrame.
#8.
Returns the data types of each column.
#9.
Returns a Series containing counts of unique values in a column.
#10.
Returns an array of the unique values in a column.
#11.
Returns the number of unique values in a column.
#DataAnalysis #Pandas #DataLoading #Inspection
Part 1: Pandas - Data Loading & Inspection
#1.
pd.read_csv()Reads a comma-separated values (csv) file into a Pandas DataFrame.
import pandas as pd
from io import StringIO
csv_data = "col1,col2,col3\n1,a,True\n2,b,False"
df = pd.read_csv(StringIO(csv_data))
print(df)
col1 col2 col3
0 1 a True
1 2 b False
#2.
df.head()Returns the first n rows of the DataFrame (default is 5).
import pandas as pd
df = pd.DataFrame({'A': range(10), 'B': list('abcdefghij')})
print(df.head(3))
A B
0 0 a
1 1 b
2 2 c
#3.
df.tail()Returns the last n rows of theDataFrame (default is 5).
import pandas as pd
df = pd.DataFrame({'A': range(10), 'B': list('abcdefghij')})
print(df.tail(3))
A B
7 7 h
8 8 i
9 9 j
#4.
df.info()Prints a concise summary of a DataFrame, including data types and non-null values.
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 2, np.nan], 'B': ['x', 'y', 'z']})
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 2 non-null float64
1 B 3 non-null object
dtypes: float64(1), object(1)
memory usage: 176.0+ bytes
#5.
df.describe()Generates descriptive statistics for numerical columns.
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3, 4, 5]})
print(df.describe())
A
count 5.000000
mean 3.000000
std 1.581139
min 1.000000
25% 2.000000
50% 3.000000
75% 4.000000
max 5.000000
#6.
df.shapeReturns a tuple representing the dimensionality (rows, columns) of the DataFrame.
import pandas as pd
df = pd.DataFrame({'A': [1, 2], 'B': [3, 4], 'C': [5, 6]})
print(df.shape)
(2, 3)
#7.
df.columnsReturns the column labels of the DataFrame.
import pandas as pd
df = pd.DataFrame({'Name': ['Alice'], 'Age': [30]})
print(df.columns)
Index(['Name', 'Age'], dtype='object')
#8.
df.dtypesReturns the data types of each column.
import pandas as pd
df = pd.DataFrame({'A': [1, 2], 'B': [1.1, 2.2], 'C': ['x', 'y']})
print(df.dtypes)
A int64
B float64
C object
dtype: object
#9.
df['col'].value_counts()Returns a Series containing counts of unique values in a column.
import pandas as pd
df = pd.DataFrame({'Fruit': ['Apple', 'Banana', 'Apple', 'Orange', 'Banana', 'Apple']})
print(df['Fruit'].value_counts())
Apple 3
Banana 2
Orange 1
Name: Fruit, dtype: int64
#10.
df['col'].unique()Returns an array of the unique values in a column.
import pandas as pd
df = pd.DataFrame({'Fruit': ['Apple', 'Banana', 'Apple', 'Orange']})
print(df['Fruit'].unique())
['Apple' 'Banana' 'Orange']
#11.
df['col'].nunique()Returns the number of unique values in a column.
❤2